laitimes

Tsinghua Big Data Forum: Zheng Wen, vice president of AI technology of Kuaishou, shares deep learning applications

author:China News Network

On April 27, on the occasion of the 108th anniversary of Tsinghua University, the Tsinghua University Big Data Research Center and the Tsinghua-Kuaishou Future Media Data Joint Research Institute co-hosted the "Tsinghua Big Data Forum - Deep Learning Technology and Application", and teachers, students and alumni of Tsinghua University gathered together to discuss and share the latest progress of deep learning technology and application.

Tsinghua Big Data Forum: Zheng Wen, vice president of AI technology of Kuaishou, shares deep learning applications

Dr. Wen Zheng, Vice President of Kuaishou AI Technology, made a keynote speech

It is reported that the Tsinghua University-Kuaishou Future Media Data Joint Research Institute was officially established in April 2018. As a university-level scientific research institution of Tsinghua University, the Institute makes full use of Tsinghua University's leading technology and Kuaishou's years of industry accumulation to carry out basic and applied research, development, integration and rapid iteration in many fields, and jointly explore a series of future media topics, so that technology can better empower users and achieve more accurate connections between people.

Dr. Zheng Wen, a 2001 alumnus of the School of Software, vice president of Tsinghua-Kuaishou Future Media Data Joint Research Institute, and vice president of Kuaishou AI Technology, shared the topic of "Application and Prospect of Deep Learning in the Field of Short Video".

Zheng Wen said that as a short video app with more than 160 million daily active users, Kuaishou's mission is to "use technology to enhance everyone's unique happiness" There are two key words, one is "everyone", which shows that Kuaishou's values are very universal, but at the same time emphasizes that everyone's happiness is "unique". It is difficult to achieve services for everyone by manual operation alone, and it needs to be achieved through artificial intelligence technology, especially the deep learning technology that has been broken in recent years.

Zheng Wen said that at present, Kuaishou is to enhance happiness through records, which can be reflected in two aspects. First, users want to see the wider world. Second, users also have a need to share themselves and be seen by the wider world.

But here is a challenge, now Kuaishou has accumulated more than 8 billion videos and hundreds of millions of users, in the face of these two massive numbers, how to effectively distribute attention? In the past, attention was generally focused on the so-called "blockbuster video", but under the blockbuster video, there is a large number of content that may contain very rich information and diverse categories, and this kind of "long-tail video" is often difficult to be noticed by others. As a result, some groups with small needs or more segmented interests often have difficulty finding the content they want.

Tsinghua Big Data Forum: Zheng Wen, vice president of AI technology of Kuaishou, shares deep learning applications

This challenge determines that it is necessary to rely on deep learning-based AI technology to solve this problem, instead of manually implementing the distribution of content matching. Kuaishou has done a lot of accumulation in AI-related technologies since early on, and there are a large number of deep learning applications in every link from video production to distribution.

Content production

Zheng Wen said that Kuaishou hopes to make the record more rich and interesting through AI technology, based on this goal, developed a large number of multimedia and AI technologies, such as background segmentation, sky segmentation, hair segmentation, human body key points, face key points, gesture key point detection, etc., and applied them in magic expressions.

The distribution of Kuaishou users and Chinese Internet users is very consistent, and a large part of the mobile phones used by Chinese Internet users are low-end mobile phones with limited computing power. In order to make the advanced technology experienced by the most users, Kuaishou customizes the underlying platform, based on Kuaishou's self-developed ycnn deep learning inference engine and media engine, so that the above technology can run efficiently on most models, and adapts and optimizes for different models and different hardware.

Zheng Wen revealed that Kuaishou also hopes to make the content quality higher, and has developed and applied a lot of image enhancement technology. For example, when a user shoots in a very low-light environment, the resulting video often loses information and detail, which can be recovered through low-light enhancement technology.

Next are some specific deep learning technologies recently developed by Kuaishou in content production. Three-dimensional face technology can restore the three-dimensional information of the face for a single face image, on the one hand, it can realize some modifications to the face, such as lighting, doing some expressions, and achieving three-dimensional face change effects; on the other hand, through the three-dimensional face information, you can extract the change of the person's expression, and then migrate the expression to the virtual cartoon image, the effect is similar to the animoji function launched by iPhonex, but the iPhonex has a structured light camera, and running animoji requires a very strong computing power. Through technology research and development, similar functions can be achieved on ordinary cameras and mobile phones with lower configurations.

Zheng Wen said that the portrait segmentation technology can distinguish the portrait and the background, do special effects on the portrait and background, or replace the background, and can also make the portrait bokeh; hair segmentation, you can divide the hair area, do the hair coloring effect. Sky segmentation technology can make the sky area more surreal and more dreamy. Human posture estimation is to predict the joint point position of the person, using this technology, you can add special effects to the human limbs, or modify the human body shape, do the body slimming function. In addition, it can reconstruct the three-dimensional information of the human body and use it to control the cartoon image.

Gesture detection is to detect a variety of specific different hand shapes to achieve "rain control" and other gameplay. In addition, there is ar camera attitude estimation, behind which is kuaishou's self-developed 3D engine, and on its basis, editor modules, rendering modules, limb modules, sound modules, etc. are added to achieve the exquisite and natural light sense and material of the model.

In terms of audio and video, a lot of intelligent algorithms are applied, such as the need for video to be as clear as possible, but also require smooth transmission, which requires some adaptive optimization for video complexity. In addition, the image will also be analyzed, such as the area of the face in the video often has the greatest impact on everyone's perception, the area of the face will be detected, and the bit rate will be higher, so that the overall look and feel is greatly improved.

Image quality is also detected, such as the presence of factors in the video production process that lead to lower image quality, such as shooting without good focus, lens not wiping for a long time, or video after multiple uploads and compressions resulting in blocky defects. These problems will be detected through the ai algorithm, on the one hand, reminding users to pay attention to these problems when shooting, on the other hand, they will also tilt high-quality videos when making video recommendations.

Understanding the content

According to Zheng Wen, after the content production process is completed, the video will be uploaded to the back-end server, where a deeper understanding of the video content is required. The content understanding of videos will be used in many aspects, such as content security, originality protection, recommendations, search, advertising, etc., which are roughly divided into two stages.

The first is the perception stage, where the machine will understand the video information from the four dimensions of face, image, music, and speech.

The face is a very important dimension, because the face often contains the most important part of the person's concern, and will detect the face area to identify age, gender, expression, etc.

Another dimension is the image level, which classifies the image, such as what the scene of the image is, and also detects what objects are in the image, evaluates the image quality, and extracts text from the image using ocr technology.

Music is an important part of influencing the appeal of a video, and the type of music can be identified from the video, and even a structured analysis of the music can be carried out, separating the accompaniment and singing parts.

Voice is also a very important dimension of video, often from the image may not be able to get the information conveyed by the video very well, then the voice is very important, will be the speech recognition into text, but also through the voice to identify the identity, age, gender and so on.

The second stage is the reasoning stage, which will multi-modal fusion of information from different dimensions to deduce higher-level speech information, or emotionally recognize the video. Knowledge graph technology is also used to store the knowledge in the video and express it in the knowledge graph. Through the reasoning of the knowledge graph, some higher-level and deeper information can be obtained.

In terms of content understanding, some more specific technologies have also been made, such as Kuaishou has developed a video labeling system that can classify most of the content and scenes that appear in the video. In the Kuaishou speech recognition function module, deep learning algorithms are used and combined with the context module, which greatly improves the recognition accuracy.

On the one hand, it is necessary to understand the video content, on the other hand, it is also necessary to understand the user, including the age, gender and other information disclosed by the user and some behavioral data generated by the user when using Kuaishou in real time. This data is transmitted to the back-end deep learning model to train vectors for user understanding. With these vectors, it is possible to predict the user's interests and his relationship with other users.

Finally, the description of the user and the understanding of the video are obtained, and the matching between the user and the video will produce trillion-level characteristic big data, which will be used in the real-time online recommendation system to predict what kind of video the user will be interested in. In addition, the content in the community will be sorted, such as how to allocate attention mentioned earlier, and I hope that the gap in attention distribution is not too large, so the distribution of video content will be adjusted according to the Gini coefficient. Factors such as the security, diversity, and protection of originality of the content are also taken into account.

Zheng Wen said that he hopes to further strengthen in-depth cooperation with teachers and students in colleges and universities, make full use of Kuaishou's massive data and powerful computing power, jointly promote deep learning technology, explore more possibilities in the future, and enhance public happiness, which is also the vision of the establishment of Tsinghua University-Kuaishou Future Media Data Joint Research Institute.

Read on