laitimes

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Reports from the Heart of the Machine

Edit: Egg sauce

ViT isn't perfect enough? Researchers from East China Normal University and other institutions have proposed a new image classification method ViR, which is superior to ViT in terms of model and computational complexity.

In the past year, Visual Transformer (ViT) has shown excellent performance in image tasks such as image classification, instance segmentation, object detection analysis, and tracking, showing the potential to replace convolutional neural networks.

However, there is still evidence that when applying multiple Transformer layers to large-scale datasets for pre-training, ViTs tend to have two problems:

First, the amount of computation is large, and the memory burden is large;

Second, there is an overfitting problem when training from scratch on a small-scale dataset.

Specifically, pre-training large-scale datasets and tuning downstream tasks are essential for virtual information processing, which often leads to excessive computation and redundancy, and adds additional parameters, thereby increasing the memory burden. In addition, ViTs with multiple Transformer encoding layers often overfit, especially when training data is limited.

To solve these problems, researchers from East China Normal University and other institutions have proposed a new method of image classification, the Vision Reservoir (ViR). By segmenting each image into a series of tokens with a fixed length, ViR builds a pure library with an almost fully connected topology to replace the Transformer modules in viT. To improve network performance, the researchers also proposed two deep ViR models.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Thesis link: https://arxiv.org/pdf/2112.13545.pdf

The researchers conducted comparative experiments with ViR and ViT on several image classification benchmarks. Without any pre-training process, ViR outperforms ViT in terms of model and computational complexity. Specifically, viR has a parameter size of about 15% or even 5% of ViT, and a memory footprint of about 20%-40% of ViT. The superiority of ViR performance can be demonstrated by Small-World features, Lyapunov index, and memory capacity.

In general, ViR can perform reasonably well with fewer layers than ViT encoders, as shown in Figure 1 below.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Figure 1: Time consumption comparison of ViR and ViT performed on the CIFAR100 dataset. The initial and final accuracy of ViR is improved compared to untrained ViT. Depth ViR is a parallel structure. At the same depth, the time cost of ViR is much lower than that of ViT.

Method introduction

ViT is essentially by treating image patches as time series, and the core innovation is to use kernel connection operations (such as dot product) to obtain intrinsic associations between image patches, such as spatial and temporal (sequential) consistency between different parts of an image. This prompted researchers to think of building a brain-like network, Reservoir Computing (RC), which combines intrinsic space-time dynamics with lower computing and memory consumption, fewer training parameters, and fewer training samples.

In the design of ViR, the researchers first introduced the topology used in the reserve pool and demonstrated some formulas and characteristics to clarify how it works. The researchers then describe the proposed ViR network and give further examples of deep ViR. Finally, they analyzed the intrinsic properties of ViR in several ways.

ViR follows a basic pipeline similar to ViT, and the overall network architecture is shown in Figure 2:

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Figure 2: Model overview. The input image is first segmented into patches of appropriate size, and then each patch is compressed into a series of sequence vectors as time input for ViR. For better performance, the core of the ViR contains a residual block that can be stacked into a depth structure.

Figure 2 depicts the proposed image classification model, the key component of which is the core of ViR, which consists of a reserve pool and residual blocks with the internal topology described above.

By further stacking the pool, the researchers gained a deep ViR that further enhanced network performance. As shown in Figure 4 below, the first is a series of reserve pools consisting of L reserve pools.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Figure 4: Structure of the depth ViR. The upper part is the serial reserve pool, and the lower part is the parallel reserve pool.

experiment

On the three classical datasets of MNIST, CIFAR10 and CIFAR100, the researchers compared the proposed ViR model with the commonly used ViT model. At the same time, the parameters in the model are compared, and the convergence speed and memory consumption of the model are analyzed. Robustness testing was also performed on CIFAR10-C. In the experiment, the original ViT was named ViT-base and some changes were made, as shown in Table 1 below.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Table 1: System parameters for ViR and ViT. N is the number of neurons in a reservoir, α is the scale parameter of the spectral radius of w, SD is the sparsity of the input matrix v, ri, rj, rk, and jump size are detailed in subsection 3.1 of the paper. In the ViT row, the patch size is the same for all tested datasets.

Without any pretraining, the researchers compared ViR1, ViR-3, ViR-6, and ViR-12 to ViT-12 by performing image classification tasks on MNIST, CIFAR10, and CIFAR100. Table 3 below shows a comparison of the accuracy of the classification and the amount of parameters.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Table 3: Comparison of ViR and ViT models across image classification datasets. The numeric suffix indicates the number of ViR layers or encoders for vits. " m " is a million-level unit symbol representation.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Figure 6: Comparison of the memory footprint of the MNIST and CIFAR100 datasets at 4 × 4, 14 × 14, and 16 × 16patch sizes.

For model robustness, the researchers assessed the loss of input images and interference with system hyperparameters.

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Table 4: The effect of input images on robustness.

For more details, please refer to the original text of the paper.

Read on