laitimes

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

author:Quantum Position

Yunzhong is from the Au Fei Temple

Quantum Position | 公众号 QbitAI

New breakthroughs have been made in visual language models, but ViT is still the preferred network structure for image encoders.

Byte proposes a new basic model, ViTamin, which is designed for the era of visual language.

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

When using the same dataset and training protocol, ViTamin improved the ImageNet zero-shot accuracy by 2.0% compared to ViT.

In addition, it has shown good results on 60 different benchmarks, such as classification, retrieval, open vocabulary detection and segmentation, and multimodal large language model.

When scaling up the parameters further, ViTamin-XL achieves an ImageNet zero-shot accuracy of 82.9% with only 436M parameters, surpassing EVA-E, which has a tenfold parameter (4.4B).

In the end, this achievement was selected as the top CVPR2024 computer vision conference.

A new benchmark in the era of visual language

In the era of visual language, how to design a better and scalable visual model?

In the ImageNet era, new vision models have been validated in the ImageNet dataset, which has also led to the emergence of new vision models. But in the age of visual language, new visual models are rarely seen.

In addition, based on the existing common vision models, how do they perform in the face of larger data scales than ImageNet? The research team tested several common models, including ViT for pure Transformer, ConvNeXt for pure convolutional networks, and CoAtNet for hybrid convolutional and Transformer.

Finally, systematic training and comparison on a publicly available dataset led to some key findings:

  • First, the scalability of the model: ViT can best adapt to tasks of different sizes due to the scalable self-attention mechanism.
  • Second, scalability of the data: As the training data increases, the performance of all models improves.
  • Third, the resolution of features: During training, the model needs to understand a wider range of information than just simple category labels. Therefore, the resolution of the extracted features has a great impact on the predictive ability of the model.
  • Fourth, hybrid architecture: In general, CoAtNet outperforms other models, but scaling it to process billions of data can be challenging.

Based on these findings, the researchers designed the ViTamin model.

It employs a hybrid architecture of three phases. The first two phases use lightweight MBConv Blocks, and the third phase includes scalable Transformer Blocks.

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

Specifically, an image is first subjected to convolutional STEM processing to obtain a feature map that is twice downsampled.

Then, the feature map goes through the first stage, which consists of two MBConv-LN Blocks, and then the second stage, which consists of four MBConv-LN Blocks, and then downsamples to obtain 16-fold downsampled two-dimensional features.

Next, these features are flattened into one dimension and fed into the third stage, which consists of N_B TFB-GeGLU Blocks. Finally, the contrast loss function is learned by comparing the image features and the language features.

The authors are committed to a simple and effective scaling law, considering only the width C of the model and the depth N_B of the third stage of the model, so that in scaling to a larger model, the width and depth required can be directly inferred through the parameter scale of the model, and then the scaling of the model can be realized.

Multiple SOTAs

In terms of zero-shot performance, the results show that the zero-shot ImageNet accuracy of ViTamin-L is 2.0% higher than that of ViT-L/14.

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

When the feature resolution was increased to 576 patches, the accuracy of ViTamin-L was further improved to 81.8%, which was 1.5% higher than the previous ViT-L/14 CLIPA-v2. On the average performance of the 38 datasets, ViTamin-L is 0.4% higher than the ViT-H/14 model, and the number of parameters is only half that of ViT-H/14.

In addition, when the model size was further scaled, the ViTamin-XL with a parameter quantity of 436M achieved an ImageNet zero-shot accuracy of 82.9%, exceeding the 82.0% achieved by EVA-E with a parameter quantity of 4.4B.

The authors further validate that the ViTamin model is a powerful visual encoder for downstream tasks.

The authors introduce a range of downstream tasks, including open vocabulary detection and segmentation, and multimodal large models (LMMs).

ViTamin can improve the open vocabulary detection task OV-LVIS by 3.1% compared with the ViT-L model. Compared with ViT-L, ViTamin improved by an average of 2.6% compared with ViT-L in 8 open word segmentation tasks.

ViTamin can be migrated directly to multimodal large models such as LLaVA, and performs well on benchmarks such as 12 multimodal Q&A. Notably, ViTamin has created a new SOTA on 7 open word segmentation benchmarks.

In this work, the authors established benchmarks for the evaluation of mainstream visual models in the context of visual language and re-benchmarked them. The authors examine the mainstream vision models from four aspects: data scalability, model scalability, feature resolution, and hybrid architecture.

Byte released ViTamin, a visual foundation model, and achieved SOTA for multiple tasks, which was selected for CVPR2024

The key findings in these four areas guide the design of ViTamin, which not only surpasses ViT in terms of zero-shot ImageNet accuracy and average 38-dataset accuracy, but also reaches the latest state of the art in 22 downstream tasks, including open vocabulary detection and segmentation and large-scale multimodal models.

From the intelligent creation team

The intelligent creation team is ByteDance's AI & multimedia technology team, covering computer vision, audio and video editing, special effects processing and other technical fields.

With the help of the company's rich business scenarios, infrastructure resources and technical collaboration atmosphere, they have realized the closed-loop of cutting-edge algorithms, engineering systems, and products, aiming to provide the company's internal businesses with industry-leading content understanding, content creation, interactive experience and consumption capabilities and industry solutions in various forms.

At present, the intelligent creation team has opened its technical capabilities and services to enterprises through Volcano Engine, a cloud service platform under ByteDance. More positions related to large model algorithms are being opened.

Paper Links:

https://arxiv.org/pdf/2404.02132.pdf

Project Homepage:

https://beckschen.github.io/vitamin

— END —

量子位 QbitAI 头条号签

Follow us and be the first to know about cutting-edge science and technology

Read on