laitimes

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Creating intelligent machines and moving towards universal AI has long been a dream of human beings. At what stage of the current AI development?

Author | Victor

Edit | Twilight

On December 9 this year, the 6th Global Conference on Artificial Intelligence and Robotics (GAIR 2021) was officially opened in Shenzhen, with more than 140 industry-academia leaders and 30 Fellow gathered to cut from the dimensions of AI technology, products, industries, humanities, and organizations, and jointly climb the top of the wave of artificial intelligence and digitalization with rational analysis and perceptual insight as the axis.

The next day of the conference, IEEE/IAPR Fellow, vice president of JD Group, vice president of JD Exploration Research Institute Mei Tao made a report on "Visual Computing from Perceptual Intelligence to Cognitive Intelligence" at the GAR conference, he pointed out that although the perceptual research of visual computing has been relatively mature, some artificial intelligence (AI) tasks have been able to pass turing tests, such as content synthesis and image recognition, but in the field of video analysis, Reasons such as the diversification of video data content and the lack of clarity in video semantics have led to a number of challenging problems in this field.

At the same time, in the field of cognition, there have been some progress in visual computing, such as Visual Genome, VCR and other datasets have laid out structural knowledge modeling; at the inference level, domestic scholars have tried to deeply understand scenes or events through joint interpretation and cognitive reasoning.

The following is the full text of the speech, AI technology reviews have been sorted out without changing the original meaning:

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Today's talk is titled "Visual Computing from Perceptual Intelligence to Cognitive Intelligence." Before we begin, use two examples of Turing tests to illustrate the progress of AI.

First of all, computer vision has reached the standard of passing the Turing test not only in the field of recognition, but also in the field of content synthesis. As shown in the image above, it has been difficult for humans to pick out two machine-composite pictures in a set of pictures.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Another example of a Turing test is "looking at a picture and speaking": given a picture, describe the content of the picture. The following two sentences are generated by humans (the first sentence) and machines (the second sentence), respectively. Obviously, if you don't look closely at the pictures, you may subconsciously think that the machine is more detailed than the human writing.

1.a dog is lifted among the flowers

2. a dog wearing a hat sitting within a bunch of yellow flowers

If you look closely at the picture, you will find that there is indeed a hand that lifts the puppy up. This also shows that the less frequent occurrence of phenomena, machines are difficult to describe, the reasons are related to the content of machine learning, and the machine does not have the ability to logically reason.

From the above two examples, we can see that in the field of perception, AI has surpassed humans; in the field of cognition, it still lacks some fire.

1

Advances and challenges in computer vision

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

The image above shows the progress made by computer vision in the past five or sixty years, and before the "fire" of deep learning in 2012, computers usually had two steps to complete vision tasks: feature engineering and model learning.

Feature engineering is characterized entirely by human ingenuity, such as the design of canny edge, Snak, Eigenfaces and other parametric features, and these methods have been quoted a lot, Canny has been cited 38,000 times, Snak 18,000 times, SIFT has exceeded 64,000 times.

After 2012, the rise of deep learning disrupted almost all computer vision tasks. It is characterized by the integration of traditional feature engineering and model learning, that is, the ability to perform feature design in the process of learning.

Another sign of the popularity of deep learning is that a large number of papers are submitted to the computer vision summit every year (CVPR, ICCV, ECCV, etc.), and if these methods perform "outstanding", they can get a lot of traffic, such as GoogleNet VGG has received 100,000 citations in less than 8 years; ResNet in 2015 has received nearly 100,000 citations in a shorter time.

This shows that the field of deep learning is developing rapidly, and more and more people are entering this field. On the one hand, not only are deep learning networks constantly "updating", but also data sets such as images and videos are constantly growing, and even some data sets have exceeded 100 million.

Among them, one of the trends in deep learning is "cross-border". In 2019, Transformer's performance in the field of natural language processing proved to be "thriving", and a large number of scholars have now begun to study how to incorporate it into the field of vision, such as Microsoft Research Asia's win transformer related work won the ICCV Best Paper Award.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

The figure above shows the trend of the dataset as the research paradigm changes. Both the category of the dataset and the size of the dataset are constantly increasing, and some datasets are more than 1 billion. The largest number of categories at present is the UCF101 dataset, which includes 101 classes. At the same time, the large scale also brings a drawback: some universities and small laboratories can not train models.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

What is progress in a particular area? In the field of image recognition, the most widely known is the ImageNet competition. The task is to predict five related labels, given a graph. As the number of layers of deep learning networks gets deeper and deeper, the rate of recognition errors is getting lower and lower, and by 2015, ResNet has reached 152 layers and has surpassed the ability of humans to recognize images.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

In the field of video analytics. Kinetics-400 video analysis mission reflects the progress in this field, from 2017 and 2019 there have been a variety of neural networks suitable for video tasks, the size and depth of the network are not consistent, and from the accuracy, recognition accuracy, there is no consistent results. In other words, there is a lot of open question in the field. As for the reasons, I personally believe that there are two types:

1. The video content is very diverse, and it is a data of space-time continuous.

2. The same semantics will have different meanings in the video. For example, the output of the same word in different tones and different expressions.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

In the past 10 to 20 years, there have been many themes in the field of visual perception. As shown in the figure above, from the pixel level of the minimum velocity to the video level, it can basically be classified into several major research areas: semantic separation, object detection, video action behavior recognition, image classification, Vision and language. Among them, Vision and language has been relatively hot in the past five years, and its requirements not only generate text descriptions from the content of the picture and video, but also can in turn generate the content of the video or picture from the text description.

To sum up, the main direction of current vision research is still RGB video and image research, in the near future, the way of imaging will change, then the data studied will not only be 2D, but also transition 3D, and even more multimodal data.

In the field of visual understanding, universal visual understanding is very simple: for example, distinguish between cats and dogs, distinguish between cars and people. But in nature, to truly understand the world, it is actually necessary to achieve very fine granular image recognition. An intuitive example is bird identification, where the ideal machine would need to identify 100,000 species of birds in order to meet humanity's requirements for "understanding the world." If it is more refined, it is necessary to achieve the fine-grained recognition of the product SKU.

Note: A bottle of 200 ml and 300 ml of mineral water is a SKU with different particle sizes.

In the past few years, JD.com has done some exploration in this regard. The path of exploration includes: the way detection is combined with attention, and the way of self-supervision. Papers covered include "Destruction and Construction Learning" for CVPR2019 and "Self-supervised" related work for CVPR 2020.

CVPR 2019:Destruction and Construction Learning for Fine-grained Image Recognition

Address: https://openaccess.thecvf.com/content_CVPR_2019/papers/Chen_Destruction_and_Construction_Learning_for_Fine-Grained_Image_Recognition_CVPR_2019_paper.pdf

CVPR 2020:Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Address of the paper: https://arxiv.org/abs/2003.14142

The video field is very challenging, and I wanted to learn from ResNet, after all, in the field of image recognition it is a very innovative network, because it contains skip level adjustments. So, at the time, I wanted to apply 2D CNN directly to the 3D field.

In fact, the relevant work has been tried, but there are certain difficulties. Facebook, for example, found that if convoluted along the three axes of xyz, the parameters exploded, so it was difficult to improve model performance. So in 2015, Facebook only designed an 11-layer 3D convolutional network.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

My attempt was to do 3D convolutional design based on ResNet, but I also encountered the same difficulties as Facebook, namely parameter explosion. So, in a CVPR 2017 work, I used a 1*3*3 two-dimensional spatial convolution and a 3*1*1 one-dimensional time-domain convolution to simulate the commonly used 3*3*3D convolution.

By simplification, compared with the same depth of two-dimensional convolutional neural networks only add a certain number of one-dimensional convolutions, in the number of parameters, running speed and other aspects will not produce excessive growth. At the same time, because the two-dimensional convolutional kernels in it can be pre-trained using image data, the need for labeled video data will be greatly reduced. At present, the paper has been cited more than 1,000 times and has been recognized by the industry.

CVPR 2017:Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Address of the paper: https://arxiv.org/abs/1711.10305

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

There are also many problems to be developed in other areas of research. For example, in 3D vision research, not only semantic segmentation is required, but also the assessment of the object's posture; in Image to Language research, it is not only necessary to generate a descriptive text given a picture, but also to know the spatial relationship between objects semantic relationships.

2

Application of visual perception

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

AI has long been considered to be changing the paradigm of industry, and in 2019, PWC (McKinsey) released a report that AI's contribution to the entire global economy will increase by 14% every year until 2030. And in China, the growth space is 26%.

Applying AI to industry basically needs to meet any of three conditions: reduce costs, improve efficiency, and improve user experience. Trillion-dollar companies, such as Microsoft and Apple, have a common feature in that they promote AI technologies comprehensively, on a large scale, and at one time.

When large-scale promotion of AI technology, many interesting applications were born, such as "photo shopping", the core technology is Photo-to-search, the field has been deeply cultivated for many years, but the real scene can play is e-commerce. Taking JD.com as an example, its photo shopping accuracy rate has been greatly improved compared with four years ago, and the user conversion rate has increased by more than ten times.

Another example in e-commerce retail is "smart collocation", which aims not only to get AI to recommend the same product, but also to let AI provide clothing suggestions. For example, when a user buys a top, the AI automatically matches a skirt or a pair of shoes and generates a description telling the user "why it fits so well." After the launch of this feature, the click-through rate it brought exceeded that of manual collocation.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Intelligent guide applications are also good at AI. For example, in a football game, there will be a lot of fixed cameras, and the video in the camera will be passed to the OB truck, and then there will be 20 to 30 staff members who will continue to make videos and provide rebroadcast streams, and everyone will see the same broadcast stream. The so-called intelligent guide refers to: using AI to learn the way of human guide, and then output the corresponding content according to each user's preferences. Users who like football will focus on pushing wonderful shots and actions; users who like stars will focus on recommending close-ups of players, so as to achieve the effect of thousands of faces.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Intelligent guidance involves a wide range of technologies, such as: motion/event recognition, face recognition, attitude estimation, highlight detection, camera view switching, and so on. It is worth mentioning that twenty years ago, when I interned at Microsoft, the tutor arranged the corresponding tasks, but due to the limitations of data and computing power, it did not achieve good results. We only launched the feature on JD.com two years ago.

The concept of a metaverse is hot, and JD.com has also made some attempts at digital people. Recently, with its cross-modal analysis technology and multimodal interactive digital human technology, it also won the Best Presentation Award (Demo) of the ACM International Multimedia Top Conference.

Traditional digital people can only carry out "text interaction", while today's digital people hope to simulate real people for dialogue, which is characterized by image, realism, real-time reaction and so on. At present, digital human technology has been successfully deployed in the mayor's hotline.

3

Towards general-purpose AI

General AI has always been the dream of human beings, and in the process of moving towards general AI, it is necessary to transition from perception to cognition in terms of vision, so that intelligent visual systems can make decisions.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

But there will be many challenges, such as robustness, directly in the field of autonomous driving, car collisions, recognition errors, etc. indicate that the system is not robust enough. Model and data bias is also a frequent focus of discussion in the academic community, some time ago Yann LeCun, a big bull in the field of AI, was dismissed from the network on Twitter because of the statement that "bias comes from data or from models".

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

There are two main differences between cognitive intelligence and perceptual intelligence, at the target level, traditional AI hopes to enhance human thinking and provide accurate results, while cognitive AI hopes to imitate human behavior and reasoning; while at the capability level, traditional AI hopes to find learning patterns or reveal hidden information; while cognitive AI hopes to model human thinking to find solutions. Obviously, cognitive AI will have many uses in the future, such as trusted systems, model interpretation, and so on.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

To achieve cognitive AI, there are three core problems to be solved: first, how to model structural knowledge; second, how to make the model interpretable; third, how to make the system have the ability to reason.

For the modeling of structural knowledge, there have been some attempts in the academic community, such as the Visual Genome dataset developed by Li Feifei of Stanford University, the VCR dataset released by the University of Washington, and so on.

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

What is the progress in reasoning? Professor Zhu Songchun of the Beijing General AI Research Institute recently published a paper in the Journal of the Chinese Academy of Engineering, saying that by decomposing a simple picture, the computer vision system should be able to perform the following tasks at the same time: 1. Reconstruct the 3D scene to estimate camera parameters, materials and lighting conditions; 2. Hierarchical analysis of the scene by attributes, fluids and relationships 3. Reasoning about the intentions and beliefs of agents (such as the man and dog in this case); 4. predicting their behavior in time series; 5. restoring invisible elements such as water and the state of unobservable objects.

论文题目:Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense

Address of the thesis: https://arxiv.org/abs/2004.09044

IEEE Fellow Mei Tao: Frontiers and Challenges in Visual Computing

Finally, I would like to conclude with a trend prediction chart from Gartner. The beginning of each technology goes through several stages, such as gaps, bubbles, bubble bursts, and bottoms, and the return of reason. As shown in the figure above, the interpretability and trustworthiness in general AI are still in the climbing stage, and computer vision has reached the end of the fourth stage, which means that in the next two or three years, computer vision will move towards the stage of technological maturity, and will be commercialized on a large scale, benefiting human life.

Leifeng NetworkLeifeng Network

Read on