Tencent Ye Cong: The computer vision technology and application behind the explosion of the circle of friends

Sharing guest: Ye Cong, Tencent technical expert

Editor: Zhang Zhiyue

Content source: DataFun AI Talk "Intelligent Technology Frontier Practice Sharing"

Production community: DataFun

Introduction: This sharing system introduces the basic knowledge of computer vision, how to use these recognition algorithms to achieve an application, and deploy and promote this whole set of processes at the same time. It mainly includes the following six parts:

1. The secret behind the explosive activities in the circle of friends;

2. Fundamentals of Computer Vision;

3. Former image processing method - traditional learning method;

4. Explosion of image processing - deep learning method;

5. Analyze the support of cloud AI capabilities;

6. Skill advancement.

The secret behind the circle of friends explosion activities

The following figure is the May Fourth Youth Day event, a face matching game, which uses the algorithm of face recognition. By uploading their own photos, you can match some of the characters from the Republic of China period and share them in an interesting way. To implement this service, a very extensible cloud architecture was adopted.

Tencent Ye Cong: The computer vision technology and application behind the explosion of the circle of friends

So what are the basics needed to do an artistic activity like the one you just did? Let's take a closer look.

Fundamentals of Computer Vision

1. Computer Vision Definition

Computer vision is the study of how to obtain high-level, abstract information from images and videos. From an engineering perspective, computer vision automates the task of mimicking vision. Computer vision includes the following branches: Instance Recognition, Object Detection, Sementic Segmentation, Motion & Tracking, 3D Reconstruction, Visual Question & Answering, Action Recognition, etc.

As computer vision has slowly matured, it can disrupt more and more areas. Basically, what we can identify with the human eye and traditional methods, computer vision will gradually change. The picture on the left is more common face recognition, such as our current variety of face shopping, brush face into the park, this brush face is actually recognition (recognition), according to some of the feature points of our face, face matching, you can know who it is.

The second is the very popular unmanned driving now, this is a more complex, real task, it can be solved in different ways, more detailed later.

The third is semantic segmentation. When we humans look at nature, after imaging the retina, we know that there are different colors. Machines use RGB-alpha to understand colors in the world. Here RGB is the three primary colors of red, green and blue. Generally speaking, true color is called 32-bit color, rgb accounts for 24 bits, and the remaining 8 bits are alpha chanel, which represents whether a pixel is transparent.

The three pictures on the right, the top of which is the grayscale map, have no color themselves. The second is a full-color image, with only RBGs without alpha transparent channels. The last one is the true color map, which has an alpha channel, with a total of 32 bits.

2. Computer vision imaging

What we often have to deal with are some more complex pictures, such as aerial pictures, thermal imaging maps, as well as X-rays, ct maps, molecular cell diagrams, in order to be able to process them using a variety of filters.

3. Computer vision processing grading

To better understand computer vision processing, a division has been made: low level, mid level, high level. Low-level things are generally more detailed, such as noise reduction, optimization, compression, edge detection. Mid level includes classification, segmentation, object detection, validation, semantic segmentation, etc. The high level is higher latitude and more macro, including scenario understanding, face recognition, unmanned driving, multi-modal problems, etc.

low level processing

On the left side of the image below is an X-ray of the chest. It is difficult to see the skeletal blood vessels in the original picture on the upper left; the lower left is reinforced, and the bones, nerve veins and blood vessels in the figure are clearly visible.

In the upper middle is the board diagram of the PCB. There is a lot of noise on the original image, and after denoise, the image becomes very smooth, and you can proceed to the next step of processing.

In the lower middle is an aerial view. The entire picture is white due to haze or fog. If you do some processing directly, such as some target recognition on the map, the effect will be very poor. So first do enhancement ( enhance ) , improve the contrast , the image becomes clear before doing further processing.

The picture on the right is the correction, which is to match the pictures from different angles.

mid level processing

The following figure is borrowed from a course by Professor Li Feifei of Stanford University. Take the classification of kittens and puppies as an example. It makes a distinction as to which category the image belongs. Knowing the classification of objects and further locating the position of objects in the image, this is single object detection. If there are many different objects in the picture, such as kittens, puppies, and ducks, it is called object detection. Object recognition is to identify all the objects on this picture. Different boxes on this picture can distinguish between different objects, so that only the approximate position of the object is, and if it is accurate to pixels, it is necessary to divide the object (instance segmentation).

The figure on the right is a higher-level processing, semantic segmentation or episodic segmentation. In many fields, there is the concept of semantic segmentation, such as in NLP, which generally refers to the different morpheme components in a sentence, from the perspective of the text. In the image field, the different elements in the picture are cut, such as the road on the right of the following figure is gray, the pedestrians are red, the plants are green, and the car is blue, which accurately divides all the objects of the same kind from the color of the picture.

High Level Processing

At present, the direction of the company's full research is basically in the high level field. Because the high level has some macro characteristics, the problems it solves are directly related to everyone. In the upper left of the figure below, we use the algorithm to grab feature points on the face, and then match the features that have been processed in the database to identify who is who. Because the picture saved in the library and the picture to be recognized are not necessarily the same angle, the light may also be different, so this is a process of blurry matching.

At the bottom left is driverless. There are now two ways to solve the problem of unmanned driving in the industry: one is to use lidar (lidar); the other is to capture video, supplemented by some sensors, such as sonar sensors, infrared sensors. It does not say that one way is necessarily better than another. Because the cost of radar is very high, resulting in a high price of the whole vehicle, the high price of the whole vehicle means that the sales volume will be low, and the data collected will be less. Machine learning, on the other hand, relies heavily on the high-quality labeled data it collects, which becomes a paradox. Therefore, some only use image recognition to approach the performance of radar, so that they can save costs and collect more data.

In the middle is a scene understanding. There are two children playing ball, they are different directions, dress differently, and some micro-movements are different. Through situational recognition, we require that an object can be identified from the picture, including what he is wearing and the objects he is holding. Then we have to speculate about what its intentions are, such as play ball, walk. On the right is a colored tree that we recognize. Different colors represent different subjects and objects, as well as definite words. In this way, we extract all the information in the picture, not only to know who it is, but also what he is doing, and to make predictions.

Then the right panel is a 3D vascular diagram. Before the doctor performs cardiovascular surgery, the blood vessels are reconstructed in 3D through the data obtained by the scan. Before surgery, doctors can tell what the thickness of each blood vessel is and where there may be a risk by looking at a 3D model. In this way, the risk of surgery is greatly reduced. These are all projects that have already landed.

There are some other, more common examples, including multi-face recognition. When we take pictures, we focus heavily on the direction of the face. The middle figure is text recognition (OCR), the figure is scanned with a laser pointer, and then recognized after scanning, which is a very old way of text recognition. Now because the technology of the whole OCR is quite mature, basically everyone provides a relatively clear photo, and then all of them are recognized. On the right is license plate recognition, which is a very commonly used technique in China.

4. Target tracking

Goal tracking is a very promising, very challenging, promising topic. The following figure shows an example of an NBA video that tracks the position of a player on a field. In the middle, the player's movements are deformed; on the right, the appearance of the whole person has changed due to the highlight hitting the player. The problem in the image below is that if the target is not captured fast enough when it moves fast, the target will blur, and if the background color is very close to the previous color, there will be interference. These problems may occur at the same time in the target tracking process, and there is currently no method that can be perfectly solved, and new methods are constantly emerging.

When tracking a face that does not move, various problems can also occur, such as the face will rotate vertically; it will also be partly on the screen and partly out of the picture; and it may also be occluded. At present, the better way to do target tracking is to use different algorithms in different scenarios.

5. Multimodal problems

Multimodal problems are problems that can only be solved by integrating computer vision, NLP, and speech recognition. For example, visual Q&A. The following picture shows the baby as an example, the problem is where the baby sits, which is where the baby is in the image. First of all, we must do the scene recognition of the picture, understand what is on the picture, such as where the baby is sitting; in addition, there must be an nlp engine to understand the problem, which asks where the child is sitting, not what he is doing; and finally, by understanding the user's question and the factors in the image, further matching. So this example contains three modes, which is a typical multimodal problem.

Another example of a multimodal problem is to understand the picture situation based on a picture and generate a text description. Here, first, we must understand what is in the picture; second, we must be able to generate a description of the sentence and story according to the factors extracted from the picture. There are at least two models in this. At the same time, in order to train the generated content description, different training sets are entered, which becomes more complicated.

Image processing methods of the past - traditional methods

Traditional image processing methods, including filtering, classification, segmentation, and object detection. Commonly used filtering methods include spatial filters, Fourier, wavelet filters, etc.; feature design methods include SIFT, HOG, etc.; classification methods include SVM, AdaBoost, Bayesian, etc.; segmentation and target detection methods include watershed, horizontal set, subjective model, etc.

1. Feature design - edge detection

To identify an image, the first thing is to enable the machine to read some of its features. From this point of view, image feature design is required. The easier feature extraction method to think of is edge features. In the following figure, for example, to identify all the coins in the figure, you can extract the edges of the coins and the edges of the pattern as an image feature.

Feature Design - Harr feature

When the edges of the object are not sharp, the more classic approach is the Harr feature. It is to represent the grayscale changes in different positions on the image in black and white boxes. The top two rows represent vertical and horizontal grayscale changes, and it has only four directions, up and down, left and right. The way the diagonal in the lower right corner is a further optimized harr method that can represent a gray change in the 45-degree direction.

Feature design - symmetry

Many identifying objects have certain local symmetries, such as people and houses, so they can use the characteristics of symmetry to solve problems. It is based on the position of the center of gravity point closer to the object, the brighter, the closer to the edge, the darker, with the center of gravity as a feature to represent the object.

Feature Design - Scale invariant features

In addition to the two characteristics of harr feature and symmetry, the scale invariant feature (sift) is commonly used in object detection. Scale space is actually a process that describes the process by which we see something from far to near, slowly becoming clear from blur. The scale invariant feature is to extract some key scale points on the picture and obtain some vector parameters in each direction. These directional vectors are then used to match some angles or photos that are not the same after rotation. Even though this image may have some occlusion, as long as its scale point is not obscured, the same is matched.

Feature Design - Directional Gradient Histogram (HOG)

Another feature method related to grayscale is called the direction gradient histogram (HOG). The person and the background in the following figure have some grayscale distinctions, which can be identified by methods. The green line in the figure represents the direction in which the grayscale changes the least on the picture. The person in this picture is wearing black clothes, and there is almost no change in grayscale, so the entire line continues to extend in a vertical manner. The background has its light from all directions, so its direction gradient histogram is more chaotic. So by using the directional gradient histogram we can identify people.

2. Segmentation and object detection

Watershed algorithm

First, the entire picture is scanned to get a graph of grayscale. Then go to the grayscale curve to fill the water, naturally there will be two adjacent valley bottoms to be connected, then build a dam, do not let them connect. Then continue through the water until two more adjacent gray areas are connected and the dam continues. Repeat this process, and finally there are multiple dams on the entire graph, and the location of each dam is actually the division of the edges.

Subjective shape model

Another traditional object detection method is called the subjective shape model. We extract all the edges of the object, deform them in various ways, and then match them with the targets to be identified. Once it's matched, we achieve the goal. The limitation of this method is that the angle of the input picture changes, or that we are dressed differently today, and it may not match exactly.

The Explosion of Image Processing - Deep Learning Methods

1. Neural networks for deep learning

Deep learning refers to deep neural networks. On the left is a simple two-layer neural network that looks like a three-layer, but generally doesn't count as an input layer. Neural networks are divided into three different layers: input layer, hidden layer, and output layer. The input layer generally obtains various inputs from the user; the hidden layer does various operations; the output layer produces results. So what is the relationship between neural networks and SVMs and logistic regression? In fact, logistic regression and SVM are a special kind of single-layer neural network.

The above figure is a simple neural network architecture, in fact, the real neural network will be far more complex than the above. The following figure is a neural network for face detection. Hidden layers distinguish between different variables when processing, and different variables are for different aspects of the problem. Finally summarized by the outlet layer.

In addition to pyramids, neural networks have other forms in the figure below. These are different network structures that are proposed according to different problems. In traditional machine learning, we are thinking about which algorithm to use and how to adjust the parameters to apply the model to the product. When it comes to deep learning, most of the research of algorithm scientists is what kind of neural network to use on the product, what type of neural network, how to design the role of each layer of it, how to design its activation factors, how to design the outlet layer, how to do aggregation, and the way of thinking has changed a lot.

Convolutional neural networks

In the CV field, the more common deep neural network is called a convolutional neural network. First, we will introduce the convolutional layer. In traditional machine learning, we do feature design, which is achieved through convolutional layers. The convolutional layer is followed by the pooling layer. The function of the pooling layer is to find important features, or to merge several unimportant features and pass them on, which can reduce the amount of data operations. The last layer is called the full connection layer, and its role is to aggregate all the previous data to produce results.

In actual use, new architectures will continue to evolve, such as the faster r-cnn in the figure below, and a lot of optimizations have been done, the most important optimization of which is the addition of RPN (region proposal network). Since the original cnn is a full search on the image, this will be very large in the case of a very large image, and the speed will be very slow. In order to increase the speed, first exclude the areas on the picture that have no target, and then run the real r-cnn in the remaining area, so that the overall speed is increased a lot.

From CNN to Faster-RCNN is the same way of thinking about problem solving, called object recognition. That regression to the classification of the way to solve the problem, the proposed a new network - yolo.

2. Image AI application case study

Case 1

The May Fourth Youth Day event, which was just on display, did just that. First of all, there are hundreds of photos of aspiring young people in the Republic of China period, extract some of these characteristics, and quantify these characteristic data. Label everyone. After training, a model can be generated. In actual use, the background will extract the feature value of the uploaded photo, match it with the existing one in the model, and then return the value as a classification plus a confidence level. Finally, another page is synthesized for everyone to forward in the circle of friends.

Case 2

The second is another particularly popular example, face fusion. Everyone often looks at military uniform photos and ancient photos, which are the result of face fusion algorithms. Its principle is like this: first of all, the user uploads the photo, but the uploaded photo is often not 100% in place, there may be angles, amplitudes, in order to make the fusion result more smooth, the key point positioning will be carried out, and some alignment correction will be carried out on the face. Then use the algorithm to cut out the face and fuse it with the template diagram. Fusion diagrams are actually not so natural, so more work is image correction, such as curve tuning, edge blending, color adjustment. In this way, you can see a photo of yourself returning to the Republic of China or the Qing Dynasty.

Case 3

The third example is storytelling based on pictures. This engine uses different algorithms. The whole training process is basically unsupervised learning, except for training the text library for storytelling. It is suitable for a wide range of scenarios. By changing different text libraries, such as from romance to science fiction, the resulting text will change from romantic to science fiction, and the flexibility is very strong.

3. Frontier analysis of computer vision algorithms

Here's a library of papers you don't want to miss: www.arxiv.org

As mentioned earlier, there are two different ways to achieve unmanned driving, one is with lidar, and the other is with monocular photos, or binocular photos. Baidu and Google mainly use lidar. Tesla uses photos, it's mainly from the cost of thinking, so it collects a lot of data.

At present, both methods are in progress, and Tesla is now through continuous optimization, and the current driving ability has reached L3, slowly approaching the effect of a car with radar. And as the amount of data grows larger, it is possible that its effect will be infinitely closer to the effect of radar. What is the point of radar? Because the radar is on the roof of the car, when it scans, it gets a bird's-eye view, which is a 3D map. It gets the position of all the objects around it relative to the car, as well as its shape, and can even be modeled in 3D.

The ordinary monocular photo is just a floor plan, no distance information. Recently, there is an algorithm called orthogonal feature transformation (OFT), which first performs orthogonal feature extraction on different objects on static pictures, and then uses this method to identify the relative positions between different objects. Through a series of calculations, the monocular chart becomes a 3D diagram. The effect of lidar is basically achieved. OFT works best among all the proposed monocular conversion to 3D diagrams.

Analyze the support of CLOUD AI capabilities

When we make a product, we must first not only have an algorithm model, but also a very robust architecture to support this algorithm model, so that tens of millions or hundreds of millions of users can use this model algorithm stably. In this case, we need to use the cloud architecture, the following figure is our cloud service architecture.

The following figure is the solution matrix of Tencent Cloud. Probably divided into several different areas: human faces (computer vision). This one includes face harmony, ID recognition, and a variety of scenario-based ones, such as smart access control objects; in the speech field, we also have ASR (speech-text) and TTS (text-speech) capabilities. The underlying layer is supported by a machine learning platform and a big data platform. The infrastructure includes the CPU, GPU, and various servers of the FPGA that we just mentioned.

Our current computer vision products are divided into four categories, including Smart Vision, which is a real-name authentication identity verification product. Shendi is about the recognition of faces in multiple scenes, such as attendance check-in, and face recognition for our payments. Mingshi is an image structured analysis, which includes our ID card identification, bank card recognition business card identification these. Magic Mirror is mainly content moderation, such as identifying sensitive information such as various video pictures.

Below is one of our privatized video management platforms called TIMatrix. It is a product for various smart building campuses. It can help a theme park, a factory or a company to quickly set up a complete set of video surveillance system. We also have a variety of A.I. engines behind us to do big data analysis, customer portrait heat map, etc., which is very suitable for some scenarios of to B.

Skill advancement

Today's sharing ends here, thank you.

To read more technical dry goods articles, please pay attention to the WeChat public account "DataFunTalk".

Sharing Guests:

Ye Cong, Tencent technology expert, former artificial intelligence technology manager of Amazon AWS. In the years of cloud computing system research and development experience, he is responsible for leading the architectural design and development of multiple million-user products, and has rich experience in multinational teams and project management. He graduated from the Department of Electrical Engineering and Computer Science from the University of Minnesota and pursued a postgraduate certificate in AI at Stanford University.

About Us:

DataFun: Focuses on the sharing and exchange of big data and artificial intelligence technology applications. Initiated in 2017, more than 100+ offline and 100+ online salons, forums and summits have been held in Beijing, Shanghai, Shenzhen, Hangzhou and other cities, and more than 2,000 experts and scholars have been invited to participate in sharing. Its public account DataFunTalk has produced 700+ original articles, millions + readings, and 140,000+ accurate fans.

Welcome to reprint to share comments, reprint please private message.