Only one-tenth of the data is needed to complete the four visual tasks, and it is actually open source!

Do you have any such troubles, my family?

Move the furniture once and change the furniture once, and those expensive and heavy furniture is not easy to carry, and it is not good to take it all away.

The next time you buy furniture again, waste money, not to mention, the key is to do the same thing back and forth! The furniture has not been used a few times, and the utilization rate is not high!

This kind of moving pain, just like in the field of AI, to do a few tasks require the development of several highly customized models, not only the amount of data required is very large, but also have to be marked from scratch every time. It cannot afford to learn data and costs a huge amount of data acquisition.

AI front-end research alone expends so much energy, not to mention the tens of thousands of long-tail tasks in application scenarios.

So what to do?

Making a universal deep learning model is the key.

Universality is the fundamental of technology

Whether at home or abroad, the underlying technology followers have the mission of designing a "universal model". The two main battlefields for building a universal model are the two most widely used directions of deep learning: language and vision.

At present, the Common Language Model (GLM) has made remarkable progress, such as BERT, T5 and GPT-3, which are already well equipped to deal with a wide range of downstream linguistic tasks.

In contrast, the study of the Universal Visual Model (GVM) has been slow to deliver a satisfactory answer.

Most previous GVM studies have primarily used a supervisory signaling source, such as ViT-G/14 with labeled supervision, SEER using contrast learning between different enhancements of samples, and CLIP using image text pairs for monitoring. If pre-trained under a single supervised signal, these paradigms do produce models that perform well in fixed scenarios. However, if it is used in downstream scenarios with diverse scenarios and diverse tasks, these models are difficult to handle.

For example, the hottest automatic driving now, the car is in a moving state, both to see the road conditions, but also to see the traffic lights, but also to pay attention to pedestrians, even after the rise of the intelligent cockpit, but also with language technology, LBS scene services, so much perception data and collaborative tasks, so many random new tasks, whether in terms of volume or dimension, the requirements for visual models have been greatly improved.

At this time, to create a general visual model, reduce the threshold of research and development, especially the time cost and capital cost of academia, in order to enjoy the ultimate downstream scene experience.

In November last year, Shanghai Artificial Intelligence Lab, together with SenseTime, Chinese University of Hong Kong and Shanghai Jiao Tong University, released the General Vision Technology System "Shusheng" (INTERN), a set of continuous learning frameworks for systematically solving a series of bottlenecks in the current field of artificial intelligence vision, such as task generalization, scene generalization and data efficiency.

Not long ago, Shanghai Artificial Intelligence Lab and SenseTime released open source platform OpenGVLab, which opens its ultra-efficient pre-training model, ultra-large-scale public data set, and the industry's first evaluation benchmark for universal visual models.

What is the magic of these open source technologies?

Marvel at miracles and create universal visual models

"Intern" is the underlying technology for practicing general visual skills.

In terms of technical implementation, the "Shusheng" technical system consists of seven modules, including three infrastructure modules and four training stages.

The three infrastructure modules are Universal Vision Data System (GV-D), General Vision Network Architecture (GV-A), and General Vision Evaluation Benchmark (GV-B).

The four training stages are: upstream basic model training (Amateur), upstream expert model training (Expert), upstream generalist model (Generalist) training, and downstream application training (Downstream-Adaptation).

Intern Structure Diagram

First, the Universal Vision Data System.

This is an ultra-large-scale fine standard dataset with 10 billion samples and various supervision signals, and four data subsets are set according to four major visual tasks: GV-Dc-36M for multimodal data GV-D-10B classification labeling, GV-Dd-3M for detection labeling, and GV-Ds-143K for split labeling.

In addition, the dataset contains 119,000 label systems, which not only covers many areas of nature and almost all labels in current computer vision research, but also expands a large number of fine-grained labels, covering attributes, states, etc. in various types of images.

And this is a big footnote to the student's "vigorous miracle".

Second, the general visual model structure.

It is built from a unified search space with CNN and Transformer.

Why build such a hybrid structure? For years, convolutional neural networks (CNNs) have dominated visual representation learning and demonstrated stable portability in downstream tasks such as image classification, object detection, and semantic segmentation. But in recent years, Vision Transformer (ViT) has been able to achieve CNN-like performance on ImageNet-1k as an image encoding model using only ordinary Transformer structures, and ViT has shown greater potential than CNNs on large-scale datasets.

Despite the performance advantages of ViT, pure Transformer networks lack certain inductive biases compared to convolutional neural networks, so more data and computing resources are required. In addition, the computational cost of self-attention is squared relative to the number of inputs, limiting the application of high input resolution. Therefore, combining CNN and Transformer and MLP to balance the two aspects of efficiency and effectiveness is the key to the universality of the model.

This model structure, which combines better generalization power with higher model capacity, is called MetaNet. Search for the network structure in the MetaNet network structure family to obtain an optimal model training structure.

MetaNet architecture for unified search: Conv and Trans represent convolution and Transformer, respectively. C and S are the number of output channels and stride lengths for each order.

Specifically, MetaNet not only proposes a unified search architecture based on the PPO algorithm of reinforcement learning, but also, in order to avoid the traditional downsampling module becoming a bottleneck in the performance of the model, Shusheng combines the contentxt-aware down-sampling modules (DSM) including local-global-DSM (LG_DSM) and global-DSM (G-DSM) to replace the original downsampling module.

Thus, at the shallow level, the model still uses convolution to extract features, but at the deep level, the model can use the Transformer module in combination with LG-DSM to better extract global information.

At the same time, Shusheng has also distilled up to 13 different model structures based on the largest MetaNet-B15, a total of 24 different model weights, and has now all been open sourced.

These model structures basically cover most of the mainstream backbone on the existing market, and can not only easily migrate to the required algorithm framework as the initialization of the new network pre-training, but also only need a shorter training time to achieve better training results than the original.

The MetaNet model compares with other model structures and the results are as follows:

Structures based on convolution, Transformer, and a mixture of the two, denoted by C, T, and H, respectively

It can be seen that in terms of image classification performance, the MN-B1, MN-B4 and MN-B7 of the MetaNet series not only have higher accuracy, but also lower FLOPS and parameter quantities compared with other SOTA models.

In addition to the classification task, Using MetaNet as a backupone for detection and segmentation, using the Mask R-CNN structure on the COCO dataset, it was found that:

With a smaller amount of model parameters, mn-B4 is 2 to 4 points more accurate than Swin-T. In addition, semantic segmentation tasks were performed on the ADE20K dataset, and the mIoU index of MN-B4 was as many as 5 points higher than that of Swin-T.

The above two experimental results show that the MetaNet series model structure, between the model accuracy and the amount of computation, has reached the new SOTA!

Finally, the Universal Visual Evaluation Benchmark.

The visual evaluation benchmark GV-B is like a "ring".

As shown in the table below, the benchmark collects 26 downstream task datasets for 4 types of visual tasks: classification, detection, segmentation, and depth estimation.

In terms of settings, the benchmark introduces percentage-shots, which only need to select a portion of the entire data set, such as 10%, 20%, and compare the model performance after reducing the amount of training data for downstream tasks.

Compared to the traditional less sample setup, this percentage sample setting preserves attributes such as the long-tail distribution of the original dataset well and reduces sensitivity to sample selection. Because there are some datasets with unbalanced distribution of sample categories, such as VOC07+12 in the following table, the distribution of percentage data will inherit this distribution.

The three columns on the right, avg, min, and max, represent the mean, minimum, and maximum values of the sample sizes of different categories in 10% of the data, respectively

Combined with the above datasets and task types, the paper selects some representative models for evaluation and comparison. To be fair, the comparison uses the official pre-training weights of these models. These models include:

RseNet

CLIP

ResNeXt

BiT

Lives

SwAV, DeepClusterV2 and MoCo v2

Detco

With a very large fine data set, model structure, and evaluation benchmarks, everything is ready, only training is undertrained.

As a classic image of ancient Chinese readers, Shusheng represents a personified role that has all kinds of talents through continuous learning and continuous growth: starting from the basic knowledge and skill learning, to touching a variety of professional knowledge bypass, and then growing into a generalist with general knowledge. With this image, the "INTERN" system can gradually achieve the integration of the general vision field through continuous learning, and finally achieve flexible and efficient model deployment.

Let's take a look at how the system has been trained, step by step from a novice to an expert to a generalist, and finally played a role in various tasks.

In the first stage, the basic ability is trained, which is called the "amateur model".

In recent years, CLIP has attracted a lot of attention for its zero-shot recognition capabilities and ability to migrate downstream tasks.

However, CLIP requires 400M image-text pairs for pre-training, which is limited to a huge amount of data, and clip is difficult to develop further. But Shusheng proposes a new training paradigm, DeCLIP (Data efficient CLIP), that enables model pretraining using both image-text, image-image, and text-text pairs to achieve versatility more effectively.

In addition, in order to take full advantage of large-scale multimodal data acquisition of the basic model, this stage proposes the Upgrade-Amateur (Up-A) visual language pre-training framework, while mining intramodal and cross-modal knowledge.

This training framework is divided into two pre-training phases: Upstream-Amateur for Global Representation (Up-A-G) and Upstream-Amateur for Local Representation (Up-A-L).

Among them, Up-A-G (left) uses the group supervision function to learn from richer supervision. Up-A-L (right) uses a local self-supervised learning method to adjust the trained visual-language model to improve its performance in intensive prediction CV tasks.

Upstream-Amateur's framework

Thanks to these built-in oversights, DeCLIP-ResNet50 can achieve zero-shot accuracy first with 60.4% on ImageNet. That's 0.8 percent more than clip-ResNet50 and 81 percent less data is used. When migrating to downstream tasks, DeCLIP-ResNet50 outperformed CLIP in 8 of the 11 visual datasets.

More critically, the upgrade-Amateur, which was completed in training, provides a high starting point for subsequent training phases.

In the second stage, the training is specialized ability, which is called the "Expert Model".

The basic model obtained in the Up-A phase shows excellent performance in general visual recognition problems. But to fully grasp the more specific tasks such as detection and segmentation, it is also necessary to carry out more professional pre-training in each task, which has led to the arrival of the second stage, the expert model.

For each expert, Shusheng adopts a simple multi-head design, each head is a subnetwork of a specific data set, branched out from a common, shared "backbone". For example, Up-E (C), Up-E (D) and Up-E (S) are used for image classification, object detection and semantic segmentation, respectively.

The third stage, which trains combinatorial abilities, is called the "Generalist model."

The multitasking described above refers to one visual problem (such as classification) of different data sets (such as ImageNet and CIFAR), or multiple visual problems (such as classification and detection) of one data set. But the key is how to integrate experts into a unified model to get a more general visual model. Therefore, after the pre-training "expert" stage, the "generalist" is used as a third pre-training stage to further unify the characteristic representation.

Shusheng proposed a new paradigm called "Mixed Parameter Sharing" to develop a generalist model called "All-Rounder".

Specifically, since the knowledge captured by experts is interrelated, when the characteristics of experts are fused into a shared representation, the cross-task knowledge transfer based on soft sharing and the universal representation learning method based on hard sharing are used to transfer information (feature transfer) between experts without introducing task conflicts, thereby further improving the model (expert) performance of multi-task training, that is, "generalist" ability.

Structurally, the generalist model is an interrelated version of all experts, so that each "expert backbone" can be referred to as a "generalist branch". In addition, we can divide each branch of generalist into images, patches, and pixels according to the task of training the corresponding expert. But both soft and hard sharing mean a leap from expert model to generalist model.

After going through the first three training phase modules, we finally came to the final task migration phase (Adaptation).

This stage belongs to the downstream of the technology chain, which is used to solve a variety of different types of tasks, and this is also the moment that most tests the ability of "shusheng" to learn from each other. At this stage, it needs to apply the general knowledge learned before to different specific tasks.

Many transfer learning methods have indeed made a lot of progress before this, but the problem is that these methods neither take advantage of the implicit information in the upstream pre-training nor take into account the shortcomings of downstream data in the less-shot scenario.

Therefore, Shusheng proposed a Multi-stage Fine-tuning (MF) method to alleviate the difficulty of transmitting in the case of less data, and then by encoding the upstream data into a generative model, that is, VQ-GAN, the pre-trained model can be transferred to multiple tasks and domains without having to use the upstream data every time, which also makes the "Shusheng" more versatile and extensible.

Multilevel Fine Tuning (MF) Overview: A VQ-GAN model is first trained with upstream data in the first phase and then reconstructed downstream data by it in the second phase. After that, the third phase only rerepresents the specific parameters of the new task for image training, while the fourth phase fine-tuns the entire model with downstream data.

At this point, a universal visual model with continuous learning ability was finally born.

And what are the specific improvements, it is better to look at a more intuitive comparison of experimental data!

One net to complete the four major tasks in the field of vision

In the field of vision, there are many tasks, and the mainstream tasks include four types: classification, object detection, semantic segmentation, and depth estimation.

Of the four missions, the most powerful visual model is the CLIP model released by OpenAI last year. However, in comparison, "shusheng" has improved in terms of accuracy and data use efficiency.

1. Precision performance

Through the evaluation and comparison of the models trained by the "shusheng" on GV-B, it is found that the MetaNet accuracy of multi-stage pre-training is excellent.

Among the 26 most representative downstream scenarios such as ImageNet, the average error rate of "Shusheng" decreased by 40.2%, 47.3%, 34.8% and 9.4% respectively in the four major tasks of classification, object detection, semantic segmentation and depth estimation.

The performance of INTERN and CLIP-R50x16 on different sample sizes was compared, and the accuracy rate was demonstrated

2. Data use efficiency

The improvement in data efficiency of "Shusheng" is particularly remarkable: only 1/10 of the downstream data is needed to exceed the accuracy of CLIP training based on complete downstream data.

Taking the evaluation comparison of CLIP-R50x16 and Up-G MN-B15 in GV-B as an example, 26 downstream task datasets of four major types of classification, object detection, semantic segmentation and depth estimation were evaluated, and the Up-G MN-B15 model, which used only 10% of the data for training, had better accuracy performance than CLIP-R50 using all training data on most datasets. This shows that MetaNet, which has undergone multi-stage pre-training, has a strong generalization ability and can achieve SOTA accuracy performance with only a small number of training samples.

In downstream visual scenarios, small sample training brings extremely high training speeds and extremely low training costs.

For example, in the flower species identification task, the "bookworm" only needs to provide two training samples for each type of flower, and can achieve 99.7% accuracy.

This flower dataset consists of 102 commonly used flowers in the UK, with 40 to 258 images in each category. There are large variations in proportions, posture and light.

Flower dataset for 102 categories: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html

Universal vision platform, officially open source

Such a powerful universal visual training model has been officially open sourced!

More critically, together with the tag dataset, network structure and evaluation benchmark mentioned above, they are uniformly packaged and open sourced in OpenGVLab.

In addition to MetaNet, the network structure also contains ResNet, MobileNet, ViT, EfficientNet, etc., which are commonly used by everyone, to meet the applications of different scenarios and empower computer vision.

However, the layout of "Shusheng" does not stop there.

OpenGVLab will work with OpenMMLab and OpenDILab previously released by Shanghai Artificial Intelligence Lab to jointly build an open source system, OpenXLab, and continue to promote technological breakthroughs and ecological construction of general artificial intelligence.

A researcher who has already used the open source platform for autonomous driving algorithms said: "The Shusheng series model fully covers from mobile deployable small models to ultra-large-scale self-developed structures, bringing hope to the industry, especially its convergence speed, which greatly saves training costs and is a major booster for technology landing." “

Not only in the field of autonomous driving, but also in smart cities, smart healthcare, smart transportation, and thousands of other intelligent fields, they will all receive the technical dividends brought by the universal visual model.

A Tencent researcher praised OpenGVLab: "It is really the conscience of the industry to open source such a big work." To put it simply, it is indeed more refined-grained than CLIP. ”

Teachers and students from the academic community also sighed about this: "OpenGVLab integrates a large number of state-of-the-art (advanced) models of various magnitudes, which is more convenient to use, eliminating the trouble of cumbersome research on different codebases and different models." ”

In other words, when code and formulas take off the tedious veneer, people discover true creativity. This is also the charm of technological innovation and platform open source.

Closer to say, playing with this universal visual model, I'm afraid it's not that the prize money is too much to fly! On the road of technological productivity, another trick to get rich was born!

At present, the "Bookworm" technical report "INTERN: A New Learning Paradigm Towards General Vision" has been released on the arXiv platform.

Welcome to dig for treasure yourself!

Leifeng NetworkLeifeng Network

Only one-tenth of the data is needed to complete the four visual tasks, and it is actually open source!

Read on