Their interpretation of the Winter Olympics has warmed the hearts of millions of people

Zhi DongXi (public number: zhidxcom)

Author | Cheng Qian

Edit | Desert Shadow

"Wu Dajing's last sprint!" In the last bend, Wu Dajing took the lead in rushing out of the bend and crossing the finish line! "On February 5, the first competition day of the Winter Olympics, the Chinese short track speed skating mixed team relay event won the first gold!

Attentive viewers may notice that there is a sign language anchor in the lower right corner of the central video screen, and what is even more amazing is that this sign language anchor is not a real person, but a sign language digital person - the central video AI sign language interpreter listens to the language, bringing wonderful "explanations" to the hearing impaired people familiar with sign language.

Their interpretation of the Winter Olympics has warmed the hearts of millions of people

▲The central video AI sign language interpreter listens to the short track speed skating mixed group relay project Wu Dajing sprint moment

We can see that the gesture of listening contains our common hand movements of the numbers "9" and "3", but unlike what we understand, "3" means "W", and "9" means "J", which is the pinyin of the initial letter of "Wu Dajing", which is amazing.

Since the opening of the Winter Olympics, the winter Olympic four gold winner Wang Meng has once again rushed to the hot search, and this time the way out of the circle is his "nagging" style explanation. With the frequent emergence of golden sentences such as "my eyes are rulers", netizens from all walks of life have said that they have been circled. It is enough to see the importance of event commentary in sports. However, most of the narrators broadcast through sound, resulting in some hearing impaired people can not feel the charm of commentary, and the emergence of sign language anchors effectively makes up for this problem.

Tencent AI Sign Language Interpreter tingyu launched central video, Tencent 3D sign language digital person Xiaocong launched Tencent Sports, bringing sign language commentary to the hearing impaired and feeling the wonderful moments on the Winter Olympics. Xiaocong and Lingyu, jointly created by Tencent PCG AI Interaction Department and CSIG Intelligent Platform Product Department, are different from previous 3D AI synthetic anchors, sign language digital people provide "silent communication" for hearing impaired people through gestures and expressions. From a technical point of view, Based on Tencent's multimodal end-to-end generation model, Tencent Sign Language Digital Man conducts joint modeling and prediction to generate sequences such as high-accuracy actions, expressions, lip movements, etc., to achieve natural, professional, and highly understandable sign language effects.

Recently, in order to uncover the black technology behind the sign language digital people, Zhidong interviewed Meng Fanbo, head of the sign language digital people project team of the Tencent PCG AI Interaction Department, and made a detailed introduction to the difficulties of sign language translation, the technical logic of Tencent sign language digital people, and the problems encountered by the technical team in development.

First, the three major technical advantages of sign language anchors, the image is realistic, the action is natural and accurate

Let's take a closer look at the little Satoshi in the GIF below, does it feel similar to a real person? And in the process of commentary, Xiao Cong's head and shoulders will also swing slightly with gestures, sign language movements are smooth and natural, coupled with expressions, oral movements, etc., the details are also in place. The realization of these effects is inseparable from the technology driven by AI, big data and so on, which is also the technical difficulty of Tencent's sign language digital people.

▲Tencent sign language digital person Xiao Cong broadcast "China won the first gold"

As you can see, the biggest difference between sign language digital people and other digital people is that they do not make a sound, but only rely on elements such as actions and expressions. Whether it is Listening or Xiao Cong, its image and sign language actions are very impressive, so what black technology is behind this?

1. Ultra-realistic realistic digital human effect

For good listeners, we only need sound, tone can express rich meaning, and sign language to the ideographic ideogram, the need for a larger body movements, more realistic character images, etc., can communicate with the audience, more realistic and intimate, in order to further make the completion of sign language translation higher, in the effective simulation of real sign language broadcasting, further enhance the user experience.

To this end, Tencent Sign Language Digital Man used the industry's leading 3D re-light scanning and restoration, facial muscle drive, and expression and limb gesture capture technology to create a digital human model that highly restores the skin of real people, has a realistic image, and natural and vivid movements.

2. Highly understandable sign language expression ability

Most people probably don't understand that learning sign language is just as difficult as learning a foreign language. Sign language is an independent language belonging to the hearing impaired, juxtaposed with Chinese, English, etc., with its own grammatical structure, word order arrangement and other rules, with a unique language system. Similar to Chinese, sign language is also divided into dialects and Mandarin, in order to further improve the popularity of sign language, the mainland also published the National General Sign Language Dictionary in 2019, further standardizing the sign language system.

Tencent's Sign Language Digital Human Sign Language Translation System is based on the National General Sign Language Dictionary, forming a mature process of chinese-to-sign language word order conversion and translation. Input-based auditor language can generate high-accuracy sign language representations with low latency, and through multimodal generation technology, real-time prediction generates the corresponding hyper-realistic 3D digital human drive parameters, and then quickly generates digital sign language broadcast video.

▲Example of Sign Language Explanation of the National General Sign Language Dictionary Application (Picture taken from the National General Sign Language Dictionary APP)

In the evaluation of the comprehensibility of the hearing impaired, the overall comprehensibility of the broadcast content of Tencent Sign Language Digital People has reached more than 90%.

3. Highly acceptable sign language display effect

People who don't know sign language, like me, may think that sign language only requires hand movements, but in fact, expressions, oral movements, posture, etc. are also the key to sign language expression. The following example is very graphic, "Understand? This question requires the linkage of body orientation, expression, eye contact, and mouth shape to effectively convey the tone of doubt.

This simple question requires so many elements, if replaced by other more informative sentences, how will the sign language digital people accurately convey the information?

▲National General Sign Language Dictionary Application Sign Language Explanation Question Pronoun Example (Picture taken from the National General Sign Language Dictionary APP)

As a visual language, sign language often requires the linkage expression of hand-controlled information and non-manual information. In addition to the above-mentioned question tone, there are many emotions such as sighs and affirmations in daily expression, in order to make sign language expression more authentic, accurate hand movements and accurate non-hand control information need to be available.

In order to achieve more accurate and natural sign language expression effects, Tencent PCG AI Interactive Department has established a Chinese-sign language translation system, which can generate sign language characterization information through machine translation, and conduct joint modeling and prediction based on multimodal end-to-end generation model to generate high-accuracy sequences of actions, expressions, lip movements, etc.

Second, create a sign language system to drive the accurate expression of sign language digital people

In the eyes of most people, sign language movements are relatively simple, different words have corresponding gestures, in fact, it is really difficult to understand. For example, when we learn English, we need to disrupt the Chinese word order and think in the English way in order to become proficient in the language. Sign language is also similar, its word order structure, sentence expression, special expressions, etc. are different from Chinese, sometimes the vocabulary in a sentence does not all need to be translated through sign language, such as quantifiers, adverbs, etc., but sometimes reasonable deletion is also a major difficulty.

In the course of the survey, the researchers found that sign language broadcasts are now added to many columns such as "News Network" and "Beijing News", but some people with hearing impairments say that they can only understand less than 60% of the content of sign language news.

In the special scenario of the Winter Olympics, the difficulty of translating sign language vocabulary such as project names and technical actions can be imagined. In order to make sign language digital people adapt to the special scene of the Winter Olympics, researchers also took a lot of effort.

Meng Fanbo said that first of all, they need to train the sign language system to deal with the noisy ambient sound of the competition and interview scene, in the early stage, the technical team selected a large number of events to report on the flash language digital people for training; secondly, sign language as an independent language, its text resources are very small, the research team can only find nearly 1.6 million valid texts through multi-party collection. Compared with the 200 million texts in Chinese and English, this volume can be said to be very small.

More importantly, there are many professional terms in sports events, and sign language digital people must ensure the accuracy of data on the basis of ensuring that the information is comprehensive and complete, so Tencent's AI interactive technology team and professional sign language teachers have reached a cooperation, and the sign language migrated to the sign language digital people has been repeatedly confirmed by the sign language consultant.

Therefore, in the face of the professional Winter Olympics, under the condition of insufficient text, how to create a "truly understandable" sign language digital person is exactly the technical barrier that Tencent's AI interaction technology team needs to cross.

1. Sign language expression word order is independent, and a mapping dictionary is established

We may be confused when we see complex sign language movements, but through the communication between Intellectuals and professionals, we find that the word order of sign language expression is very different from that of Chinese. For example, in sign language expression, words expressing the purpose of the act are typed first, followed by words that indicate the object of the act, and the Chinese sign language of "I want to go home" is expressed as "home back I want to".

In the process of sign language translation, not only does it need to correspond each word one-to-one, but also adjust its order to facilitate the understanding of the hearing impaired. Therefore, Tencent's AI interactive technology team established a mapping dictionary and language system between Chinese and sign language, and translated Chinese into sign language that conforms to natural sign language norms and the expression habits of people with hearing impairments.

2. Build a framework for sign language systems and delete quantifiers as needed

Pinyin is used when indicating personal names in sign language, but the Winter Olympics, as an international sporting event, has many foreign athletes, which is more complex than the pinyin of Chinese names. If sign language had been expressed one by one, the interview might have been over.

Under the premise of fully expressing the meaning of sentences, Tencent's AI interactive technology team uses intelligent summary technology to upgrade the summary by chapter to compression by sentence, streamline the ASR recognition text, capture key information, and omit words such as quantifiers and degree adverbs. For example, the conventional commentary is: "Looking at the slow motion, it can be seen that Gu Ailing's height is higher than other players, very ethereal, very good-looking." "Can be compressed to" Gu Ailing's height is higher than other players, very ethereal and good-looking. The text length is reduced to 60% of the original explanatory words. This ability to properly cut and maintain a complete summary is a key prerequisite for sign language expression.

Tencent AI interactive technology team under the sign language consultant team, sign language research inventory, to build a sign language basic system framework, the development of sign language translation system, just by entering the listener language, you can generate high-accuracy sign language representations through machine translation.

In addition, in order to ensure the consistency of the duration of the original video and the sign language video, the translation process of the sign language digital person will dynamically control the Chinese sentences. Compress the text according to the time, sentence meaning, etc., and finally generate a corresponding sign language video.

Meng Fanbo said: "In terms of video and audio processing, we have done fault-tolerant alignment processing, and the delay of the live translation process is controlled within an acceptable range. In order to ensure the stability of sign language video processing and the consistency of the audience experience on the subsequent links, we have also smoothed the audio transmission and recognition inputs. At present, the compression ratio of Chinese and sign language is about 60%, which will be adjusted according to the actual situation. ”

3. Integrated manual and non-manual control information, understandable degree of more than 90%

The magic of Chinese is that the same sentence has a completely different intonation. Then in sign language, how the same sentence expresses the different emotions of the speaker, and how to accurately convey the meaning of the sentence with more varied expressions, gestures and postures is also the technical difficulty of creating a sign language digital person.

Sign language requires a combination of elements to convey complete meaning to the hearing impaired. Based on Tencent's multimodal end-to-end generation model, the researchers extracted multimodal information under the sign language language system, such as gesture vocabulary, expression oral movements, body rhythm, word order prosody, etc., synchronized sign language movements and facial expressions, and further optimized sign language expression.

Through this technology, AI sign language can be understood by more than 90%.

Third, create a visual action editing platform and generate sign language videos with low latency

These technologies mentioned above make sign language digital people truly understandable, but how to make this technology really bring benefits to the hearing impaired and can be effectively applied to news broadcasts, so Tencent's AI interactive technology team has created a set of visual action editing platforms to help its large-scale application.

Based on a complete sign language translation system and a mature PaaS system, the visual action editing platform can realize low-latency rapid translation and realize "second-to-second sign language" on the basis of ensuring the completeness and accuracy of semantics.

Speaking of making sign language digital people truly available, Meng Fanbo said: "Sign language digital people for the Winter Olympics scene is only our first step, and in the future, we will consider the application of hearing impaired people in real-time and non-real-time scenes to cover the different needs of hearing impaired people." ”

1. Generate sign language videos with low latency

The power of the visual action editing platform is that it can quickly generate sign language videos from Chinese text and video files, in this link, the time required for conversion and translation is relatively short, and it is possible that the moment you hear the news broadcast, the sign language digital people have also fully transmitted the content.

So, what is the specific implementation process of this system to generate sign language videos? In the system, input a piece of text or video for pre-processing, the content processing process includes multi-modal video content extraction, video speech extraction, intelligent axis, embedded subtitle OCR extraction, etc., to generate sign language translation elements, including gestures, limbs, expressions, lip movements, etc., to further ensure the accuracy of word order conversion, expression posture and other features, relying on hyper-realistic digital people-driven, quickly generate corresponding sign language videos.

2. Meet the draft and no draft scenarios

At present, most TV programs have subtitles, but some live programs and radio programs may not have subtitles, only sound. In this case, Tencent Sign Language Digital People can also cope, not only can extract text information, but also can identify audio and video.

In scenarios such as real-time news information, in order to further promote barrier-free communication of information and convey more information to the hearing impaired through sign language digital people, Tencent's visual action editing platform can meet both unscripted and manuscript scenarios, and supports the addition of sign language interpretation capabilities to live programs in the form of video streams.

After entering the program source, the visual action editing platform can extract the audio stream and video stream, extract the text information for sign language translation, quickly generate the sign language video, then encode it, transmit the video stream, integrate with the program video, and form a video stream for the live broadcast scene.

3. Quickly learn and update hot words

Now more and more hot words, new words appear in our daily communication, the same words put on the Internet has a completely different meaning, of course, many hearing impaired people will also follow the trend. And these words are now frequently used in many videos, which also poses a challenge for sign language broadcasts.

Tencent Sign Language Digital People can learn on their own, quickly supplement a large number of new words and hot words, and researchers have specially sorted out and optimized the sign language vocabulary of sports competitions in the Winter Olympic Games. At present, Tencent Sign Language Digital People already has complete sports interpretation sign language skills.

Speaking of the updated iteration of the sign language thesaurus, Meng Fanbo revealed that they created a visual action editing platform for sign language digital people, which can realize batch editing and generation of sign language actions without the need to capture each word, which greatly improves the production efficiency of sign language vocabulary.

Tencent AI interactive technology team has been deeply involved in digital human technology for many years, and the existing big data platform can introduce high-frequency Chinese text into the pre-training model, while dynamically loading and retrieving, labeling the new and hot word sign language playing methods obtained, and combining with the backend, predicting the part of the oov vocabulary playing method according to the vocabulary type, which can ensure the coherence of the final output.

Conclusion: Tencent Sign Language Digital People Help Barrier-free Information Dissemination

As a leading enterprise in the field of hyper-realistic 3D digital people in mainland China, Tencent's AI interaction technology team will focus on the hearing impaired people and rush to higher technical barriers. Tencent continues to improve digital human technology, providing new output methods for industries with strong demand for content broadcasting, and narrowing the distance between humans and machines.

Tencent's AI sign language anchor system not only completes sign language translation such as word order construction and expression generation, but also relies on ultra-realistic digital people to output sign language videos with low latency. For the audience, we can only see the final generated sign language video, but the technical system construction after that is very large, which is also the barrier to the development of sign language digital human technology.

With the rapid development of science and technology, Tencent has been thinking about how to use technology to narrow the distance between 27 million hearing-impaired people and society. This time, Tencent Sign Language Digital People Listening and Xiaocong were launched at this important node of the Winter Olympics, which can receive the attention of more effective users. At the same time, Meng Fanbo said that Tencent is also constantly optimizing related functions to be compatible with more scenarios around the Winter Olympics scene. In the future, Tencent Sign Language Digital Will also provide services in more scenarios, explore offline scenes such as life services, cultural culture and tourism in addition to news reports, uphold science and technology for good, and help create a barrier-free information dissemination environment.

Their interpretation of the Winter Olympics has warmed the hearts of millions of people

Read on