Mamba just got cold on fire? Does Vision Really Need Mamba?

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

0. What is this article about?

In recent years, Transformer has become the mainstream backbone of various tasks, supporting many important models such as BERT, GPT series, and ViT. However, Transformer's token mixer, or attention, creates quadratic complexity as the length of the sequence increases, posing a significant challenge to processing long sequences. To solve this problem, a series of token mixers related to the linear complexity of token length have been introduced, such as dynamic convolution, Linformer, Longformer, Big Bird, and Performer. Recently, a new wave of RNN-like models has emerged, which has generated a lot of interest from the community because of their ability to train in parallel and perform efficiently when processing long sequences. Notably, models like RWKV and Mamba have proven to be effective backbones of large language models (LLMs).

Inspired by the promising capabilities of RNN-like models, various studies have attempted to introduce Mamba into visual recognition tasks, such as the pioneering work of Vision Mamba, VMamba, LocalMamba, PlainMamba. Mamba's token mixer is a structured state space model (SSM) that is in line with the spirit of RNNs. However, their experiments have shown that SSM-based vision models actually perform poorly compared to state-of-the-art convolutional-based and attention-based models. This raises a compelling research question: do we really need Mamba for visual identity?

In this paper, we investigate the nature of Mamba and conceptually conclude that Mamba is ideally suited for tasks with two key characteristics: long sequences and autoregression, as it has the inherent RNN mechanism of SSM. Unfortunately, not many visual tasks have both. For example, image classification on ImageNet does not conform to either long sequence features or autoregressive features, whereas object detection and instance segmentation on COCO and semantic segmentation on ADE20K only conform to long sequence features. On the other hand, the autoregressive feature requires each token to aggregate information only from the previous and current tokens, a concept known as the causal pattern of token mixing. In fact, all visual recognition tasks belong to the comprehension domain rather than the generative domain, meaning that the model can see the entire image at once. As a result, imposing additional causal constraints on token mixing in the visual recognition model can lead to performance degradation. Although this problem can be mitigated by bidirectional branching, problems inevitably exist within each branch.

Based on the above conceptual discussion, we propose the following two hypotheses:

• Hypothesis 1: SSM is not necessary for an image classification task because it does not conform to either long sequence or autoregressive features.

• Hypothesis 2: SSMs may have potential benefits for object detection and instance segmentation and semantic segmentation tasks because they conform to long sequence characteristics, although they are not autoregressive.

To experimentally test our hypothesis, we developed a series of models called MambaOut, gated by stacking CNN blocks. The main difference between gated CNNs and Mamba blocks is the presence of SSMs. Experimental results show that the simplified MambaOut model has actually outperformed the visual Mamba model, validating our hypothesis 1. We also present empirical results showing that MambaOut performs less than the state-of-the-art visual Mamba model in detection and segmentation tasks, which highlights the potential of SSM in these tasks and effectively validates our hypothesis2.

1. Thesis information

标题：MambaOut: Do We Really Need Mamba for Vision?

作者：Weihao Yu, Xinchao Wang

Institution: National University of Singapore

Original link: https://arxiv.org/abs/2405.07992

Code link: https://github.com/yuweihao/MambaOut

2. Summary

Mamba, an architecture with RNN-like state space model (SSM)-like labeled mixers, was recently introduced to address the quadratic complexity of attention mechanisms and subsequently applied to visual tasks. However, compared to convolutional and attention-based models, Mamba's performance in vision is generally disappointing. In this article, we delve into the nature of Mamba and conceptually conclude that Mamba is well suited for tasks with long sequences and autoregressive properties. For the visual task, since image classification does not conform to either the long sequence or the autoregressive characteristic, we assume that Mamba is not necessary for this task. Detection and segmentation tasks are also not autoregressive, but they follow a long sequence characteristic, so we think it is still worthwhile to explore Mamba's potential in these tasks. To empirically test our hypothesis, we built a series of models, named MambaOut, to remove their core marker mixers by stacking Mamba blocks SSM. The results strongly support our hypothesis. Specifically, our MambaOut model outperformed all visual Mamba models on the ImageNet image classification task, suggesting that Mamba is indeed unnecessary for this task. As for detection and segmentation tasks, MambaOut cannot match the performance of state-of-the-art vision Mamba models, showing the potential of Mamba in long-sequence vision tasks. The code is available at https://github.com/yuweihao/MambaOut.

3. Effect display

(a) Architecture of Gated CNNs and Mamba modules. The Mamba module extends the Gated CNN with an additional State Space Model (SSM), which SM is not necessary for image classification on ImageNet. To empirically test this claim, we stacked the Gated CNN module to build a series of models called MambaOut. (b) MambaOut outperformed visual Mamba models such as Vision Mamba, VMamba, and PlainMamba on ImageNet image classification.

Mamba just got cold on fire? Does Vision Really Need Mamba?

4. Major Contributions

The contribution of our paper is threefold:

First, we analyze the RNN-like mechanism of SSM and conceptually conclude that Mamba is suitable for tasks with long sequences and autoregressive characteristics.

Second, we examine the features of the visual task and assume that SSM is not necessary for the image classification task on ImageNet because the task does not conform to either long sequence features or autoregressive features, but it is still valuable to explore the potential of SSM for detection and segmentation tasks because these tasks conform to long sequence features, although they are not autoregressive.

Third, we developed a series of models based on gated CNN blocks but without SSM, called MambaOut. Experiments show that MambaOut actually surpasses the visual Mamba model in the ImageNet image classification task, but fails to achieve the performance of the state-of-the-art visual Mamba model in the detection and segmentation tasks. These observations further validate our hypothesis. Thus, based on the Occam's razor principle, MambaOut can serve as a natural baseline for future research on the visual Mamba model.

5. What is the rationale?

From a memory perspective, causal attention and the mechanism of an RNN-like model are illustrated, where xi represents the input marker of step i. (a) Causal attention stores all previously labeled key k and value v as memory. The memory is non-lossy because the memory is updated by constantly adding the keys and values of the current tag, but the disadvantage is that the computational complexity of integrating the old memory and the current tag increases as the sequence length increases. As a result, attention can effectively manage short sequences, but may experience difficulties in processing longer sequences. (b) In contrast, the RNN-like model compresses the previous tag into a fixed-size hidden state h, which acts as memory. This fixed size means that RNN memory is inherently lossy and cannot directly compete with the lossless memory capacity of the attention model. Nonetheless, RNN-like models can show significant advantages when dealing with long sequences, as the complexity of merging old memory and current inputs remains constant regardless of sequence length.

(a) Two modes of token mixing. For a total of T tokens, the fully visible mode allows token t to aggregate the inputs of all tokens, i.e., {xi}Ti=1, to calculate its output yt. In contrast, the causal pattern restricts token t to only aggregate the inputs of the previous and current tokens {xi}ti=1. By default, attention actions are in fully visible mode, but can be adjusted to causal mode with the causal attention mask. RNN-like models, such as Mamba's SSM, operate in causal mode in nature due to their cyclic nature. (b) We modified the ViT's attention from the all-visible mode to the causal mode and observed a decrease in performance on ImageNet, suggesting that causal mixing is unnecessary for understanding the task.

(a) The overall framework of MambaOut's visual identity. Similar to ResNet, MambaOut uses a layered architecture with four stages. Di represents the channel dimension of the ith stage. (b) The architecture of the gated CNN block. The difference between a gated CNN block and a Mamba block is that there is no SSM (State Space Model) in a gated CNN block.

6. Experimental results

The MambaOut model, the visual Mamba model, and various other convolutional and attention-based models perform on ImageNet as shown in Table 1. It is important to note that our MambaOut model, which does not contain SSM, is consistently superior to the visual Mamba model that includes SSM on all model sizes. For example, the MambaOut-Small model has a top-1 accuracy of 84.1%, which is 0.4% higher than LocalVMamba-S, while requiring only 79% MACs. These results strongly support our hypothesis 1 that the introduction of SSM on ImageNet for image classification is unnecessary and in line with Occam's razor principle.

In addition, compared to state-of-the-art convolution and attention models, the visual Mamba model currently has a significant performance gap. For example, CAFormer-M36 using traditional marker mixers such as simple separable convolutions and standard attention mechanisms outperforms all visual Mamba models of the same size by more than 1% in accuracy. If future research aims to challenge our hypothesis1, a visual Mamba model with convolution and SSM marker mixers will need to be developed to achieve state-of-the-art performance on ImageNet.

Although MambaOut can outperform some vision Mamba models in terms of object detection and instance segmentation on COCO, it still lags behind the most advanced vision Mambas such as VMamba and LocalVMamba. For example, MambaOut-Tiny, which is the backbone of Mask R-CNN, has a performance of 1.4 APb and 1.1 APm lower than VMamba-T. This performance gap highlights the benefits of integrating Mamba in long-sequence vision tasks, further reinforcing our hypothesis2. However, compared with TransNeXt, a state-of-the-art convolutional-attention hybrid model, Vision Mamba still has a significant performance gap. Vision Mamba needs to further validate its effectiveness by outperforming other state-of-the-art models in vision inspection tasks.

The performance trend of semantic segmentation on the ADE20K is similar to that of object detection on COCO. MambaOut can outperform some visual Mamba models, but it can't match the state-of-the-art Mamba models. For example, LocalVMamba-T outperformed MambaOut-Tiny by 0.5 mIoU in single-scale (SS) and multi-scale (MS) assessments, respectively, further empirically confirming our hypothesis2. In addition, compared with more advanced hybrid models that integrate convolution and attention mechanisms, such as SG-Former and TransNeXt, the visual Mamba model still has significant drawbacks in performance. Vision Mamba needs to further demonstrate its advantages in long sequence modeling by providing stronger performance in visual segmentation tasks.

7. Summary & Future Work

This paper discusses the concept of the Mamba mechanism and concludes that it is well suited for tasks with long sequences and autoregressive properties. We analyzed common visual tasks based on these criteria and concluded that the introduction of Mamba in ImageNet image classification was unnecessary because it did not conform to feature 1 or feature 2. However, the potential of Mamba in visual detection and segmentation tasks consistent with long sequence properties deserves further exploration. To empirically substantiate our claims, we developed the MambaOut model that uses Mamba blocks but does not have its core token mixer, SSM. MambaOut outperforms all visual Mamba models on ImageNet, but has a significant performance gap compared to state-of-the-art visual Mamba models, validating our claim. Due to the limitation of computing resources, this paper only verifies the application of the Mamba concept in vision tasks. In the future, we may further explore the concepts of Mamba and RNN, as well as the integration of RNNs and Transformer into large language models (LLMs) and large multimodal models (LMMs).

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。