laitimes

Why did Mamba subvert Transformer's dominance in computer vision?

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Deep neural networks (DNNs) have demonstrated significant performance in a variety of artificial intelligence (AI) tasks, and the underlying architecture plays a crucial role in determining the model's capabilities. Traditional neural networks typically include multilayer perceptron (MLP) or fully connected (FC) layers. Convolutional neural networks (CNNs) introduce convolutional and pooling layers, which are particularly suitable for working with translationally invariant data like images. Recurrent neural networks (RNNs) utilize recurrent units to process sequential or time-series data. In order to solve the problem that CNN, RNN, and GNN models only capture local relationships, Transformer proposed in 2017 and excels in learning long-distance feature representations. Transformer relies primarily on attention-based attention mechanisms, such as self-attention and cross-attention, to extract intrinsic features and improve their representation. Pre-trained large-scale Transformer-based models such as GPT-3 excel on a variety of NLP datasets and excel at natural language understanding and generative tasks. The widespread adoption of Transformer model-based in vision applications drives its excellent performance. At the heart of the Transformer model is its extraordinary skill in capturing long-distance dependencies and maximizing the utilization of large datasets. The feature extraction module is the main component of the vision Transformer architecture. It uses a series of self-attention blocks to process the data, significantly improving its ability to analyze images.

However, a major obstacle to Transformer is the huge computational demand of the self-attention mechanism, which increases quadratically with the increase of image resolution. Softmax operations within the attention block further exacerbate the computational requirements, creating significant challenges for implementing these models on edge and low-resource devices. In addition, utilizing Transformer-based real-time computer vision systems must adhere to strict low-latency standards to maintain a high-quality user experience. This situation highlights the continued evolution of new architectures to improve performance, although this often comes with a trade-off of higher computing needs. Many new models based on sparse attention mechanisms or innovative neural network paradigms have been proposed to further reduce computational costs while capturing long-distance dependencies and maintaining high performance. State-space models (SSMs) have become a central focus of these developments. As shown in Figure 1(a), there has been an explosive increase in the number of publications related to SSMs. Originally designed to simulate dynamic systems in fields such as control theory and computational neuroscience, using state variables, SSMs primarily describe linearly invariant (or stable) systems when adapted for deep learning. With the development of SSMs, a new type of selective state-space model called Mamba has emerged. It improves the modeling of state-space models (SSMs) of discrete data, such as text, with two key improvements. First, it has a mechanism to adjust SSM parameters based on inputs, dynamically enhancing information filtering. Second, Mamba uses a hardware-aware algorithm that processes data linearly based on sequence length, increasing the computational speed on modern systems. Inspired by Mamba's achievements in language modeling, there are now several initiatives aimed at adapting this success to the visual domain. Many studies have explored its integration with Mixture-of-Experts (MoE) technologies, such as Jamba, MoE-Mamba, and BlackMamba, which surpass the existing technology architecture Transformer-MoE with fewer training steps. As shown in Figure 1(b), since the release of Mamba in December 2023, the number of research papers focusing on Mamba in the field of vision has increased rapidly, peaking in March 2024. This trend suggests that Mamba is emerging as a prominent area of research in vision and may provide a viable alternative to Transformers. Therefore, a review of current relevant work is necessary and timely to provide a detailed overview of this new approach in this evolving field. Therefore, we provide a comprehensive overview of how the Mamba model is used in the visual domain. This article aims to provide a guide for researchers who wish to delve deeper into this field.

Why did Mamba subvert Transformer's dominance in computer vision?

Key contributions to our work include:

(1) This survey paper is the first comprehensive review of Mamba technology in the field of vision, with a clear focus on the strategies proposed by the analysis.

(2) Expanding on the Naive-based Mamba visual framework, we investigated how to enhance Mamba's capabilities and combine with other architectures to achieve higher performance.

(3) We conducted an in-depth discussion by organizing the literature according to various application tasks. We established a taxonomy that identified specific progress for each task and provided insights on overcoming challenges.

The review is structured as follows: Section 2 explores the general and mathematical concepts of the Mamba strategy. Section 3 discusses the Naive Mamba vision models and how they can be integrated with other technologies to enhance performance, which has been proposed in recent years. Section 4 explores the application of Mamba technology to solving a variety of computer vision tasks. Finally, Section 5 summarizes the survey.

标题:A Survey on Visual Mamba

作者:Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Zi Ye

Institutions: Automotive Software Innovation Center, University of Chinese Academy of Sciences, University of Science and Technology of China, Institute of Intelligent Software, Saarland University

Original link: https://arxiv.org/abs/2404.15956

State-space models (SSMs) with selection mechanisms and hardware-aware architectures, or Mamba, have recently shown significant potential for long-sequence modeling. Due to the quadratic complexity of the self-attention mechanism in the Transformer as the image size increases, and the computational demands increase, researchers are now exploring how to adapt Mamba for computer vision tasks. This article is the first review aimed at providing an in-depth analysis of Mamba models in the field of computer vision. It begins with a discussion of the fundamental concepts that have contributed to Mamba's success, including the state-space model framework, selection mechanisms, and hardware-aware design. Next, we review these visual MAMBA models by classifying them as basic models and augmenting them with techniques such as convolution, looping, and attention to improve their complexity. We further delve into the wide range of applications of Mamba for vision tasks, including using them as the backbone of various levels of vision processing. This includes general vision tasks, medical vision tasks (e.g., 2D/3D segmentation, classification, and image registration, etc.), as well as remote sensing vision tasks. In particular, we introduce general vision tasks at two levels: high/intermediate vision (e.g., object detection, segmentation, video classification, etc.) and low-level vision (e.g., image super-resolution, image restoration, vision generation, etc.). We hope that this effort will spark additional interest within the community to address current challenges and further apply the Mamba model to the field of computer vision."

SSMs are typically used as independent sequence transformations that can be integrated into end-to-end neural network architectures. Here we describe a few basic architectures. Linear attention is approximated by a cyclic mechanism as a simplified form of linear SSM. H3, as shown in Figure 2, places an SSM between the two gated connections and inserts a standard local convolution in front of it. After H3, Hyena, replaces the SSM layer with MLP-parameterized global convolution. RetNet introduces additional gates and uses a simpler SSM. RetNet enables an alternative parallelizable computation path and uses a variant of Multi-Head Attention (MHA) instead of convolution. Inspired by the inattentive Transformer, the recent RNN design RWKV, which can be interpreted as a ratio of two SSMs, since its main "WKV" mechanism involves a cycle of linear time invariance (LTI).

Why did Mamba subvert Transformer's dominance in computer vision?

The original Mamba block was designed for one-dimensional sequences, however vision-related tasks require processing multi-dimensional inputs such as images, videos, and three-dimensional representations. Therefore, in order to adapt Mamba to these tasks, improving the scanning mechanism and architecture of Mamba blocks is essential for effectively processing multidimensional inputs.

In this section, we present efforts to enable Mamba to handle vision-related tasks and enhance their efficiency and performance. Initially, we dive into two foundational tasks: Vision Mamba and VMamaba. These efforts introduced ViM blocks and VSS blocks, respectively, as the basis for subsequent research efforts. Subsequently, we explored other efforts focused on improving the Mamba architecture as the backbone of vision-related tasks. Finally, we discussed the work of integrating Mamba with other architectures such as convolution, looping, and attention.

4.1 Visual Mamba block

Inspired by the visual Transformer architecture, it seems natural to keep the framework of the Transformer model while replacing the attention block with a Mamba block, leaving the rest of the process intact. At the heart of the problem is the adaptation of the Mamba block to visually-related tasks. Almost at the same time, Vision Mamba and VMamba came up with their own solutions: ViM blocks and VSS blocks.

ViM blocks, sometimes referred to as bidirectional Mamba blocks, use positional embeddings to annotate image sequences and bidirectional state-space models to compress visual representations. It processes both forward and backward inputs, using one-way convolution in each direction, as shown in (a) of Figure 4.

Why did Mamba subvert Transformer's dominance in computer vision?
Why did Mamba subvert Transformer's dominance in computer vision?

The Visual State Space (VSS) block contains key state space model operations. It first directs the input through a deep convolutional layer, then activates the function through SiLUs, and then uses approximate B through a state-space model. The output of the state-space model is then layer-normalized and then merged with the output of other information flows, as shown in (b) of Figure 3. To address the direction-sensitive problem encountered, they introduced the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual images into sequential patch sequences, as shown in (b) of Figure 4.

4.2 纯Mamba

Inspired by the Vision Transformer architecture, Vision Mamba replaces the Transformer encoder with a ViM block-based Vision Mamba encoder while retaining the rest of the process. This involves converting a 2D image into flat patches, then projecting those patches into vectors, and adding positional embeddings. A class token represents the entire patch sequence, and subsequent steps involve the normalization layer and the MLP layer to derive the final prediction.

LocalMamba is built on top of ViM blocks, which introduces a novel approach to local scans contained within different windows to capture detailed local information and global context. In addition, LocalMamba searches the direction of scans between different network layers to identify and apply the most efficient combination of scans. They came up with two variants, the ordinary structure and the hierarchical structure. They came up with the LocalVim block, which consisted of four scanning directions (see (d) of Figure 4): vim scanning and partitioning tokens into different windows and their flipping counterparts, scanning from tail to head, state space module, and spatial and channel attention module (SCAttn).

Based on the VSS block, the PlainMamba block enhances its ability to learn features from 2D images through two main mechanisms: (i) employing a continuous 2D scanning process to improve spatial continuity, ensuring that tokens in the scan sequence are contiguous, as shown in (c) of Figure 4, and (ii) fusing direction-aware updates, which enable the model to identify spatial relationships between tokens by encoding orientation information. PlainMamba improves the spatial discontinuity when moving new rows/columns in Vim and VMamba's 2D scanning mechanism by continuing to use scanning in the opposite direction until the final visual token of the image is reached. In addition, PlainMamba eliminates the need for special tokens.

In lightweight model design, EfficientVMamba improves VMamba's capabilities through an aperture-based selective scanning method, Efficient 2D Scanning (ES2D). ES2D employs a strategy of advancing the scan vertically and horizontally while skipping patches and keeping the number of patches the same, as shown in (e) of Figure 4. Their Efficient Visual State Space (EVSS) block includes a convolutional branch for local features, using ES2D as an SSM branch for global features, and all branches ending with a squeeze-excitation block. They employ EVSS blocks in both Phase 1 and Phase 2, while inverted residual blocks are selected in Phase 3 and Phase 4 to enhance the capture of the global representation.

Multidimensional data is part of multidimensional data. As part of multidimensional data, existing multidimensional data models are also suitable for vision-related tasks, but often lack the ability to facilitate cross-dimensional and intradimensional communication or data independence. The MambaMixer block introduces a double selection mechanism across tokens and channels. The sequential selection mixers are then connected by a weighted averaging mechanism, giving the layers direct access to the inputs and outputs from each layer. Mamba-ND extends the application of SSM to higher dimensions by alternating sequences between layers. Extend this approach to 3D with a similar scanning strategy for the same 2D scenario as VMamba. In addition, they advocate the use of multi-head SSM as an analogy for multi-headed attention. In view of the inefficiency and performance challenges encountered by traditional Transformer in image and time series processing, a new architecture called Simplifying Mamba Infrastructure, SiMBA, is proposed to use Mamba blocks for sequence modeling and EinFFT for channel modeling, aiming to enhance the stability and efficiency of model processing image and time series tasks. Mamba blocks have proven to be effective in processing long sequence data, while EinFFT represents a novel channel modeling technique. Experimental results show that SiMBA outperforms existing state-space models and Transformer in multiple benchmarks.

As an important part of Mamba, the scanning mechanism not only contributes to efficiency but also provides information in vision-related task scenarios. We summarize the use of different scanning mechanisms in existing works, as shown in Table 1. Cross-Scan and BiDirectional Scan are the most widely used scanning mechanisms. However, various other scanning mechanisms have specific uses. For example, 3D BiDirectional Scan and Spatiotemporal Selective Scan are customized for video input. Local Scan focuses on gathering local information, while ES2D prioritizes efficiency.

Why did Mamba subvert Transformer's dominance in computer vision?

4.3 Mamba与其他架构

To combine Mamba with convolution, Mamba introduces the ability to obtain local information, which is essential for tasks related to medical images or segmentation tasks. RES-VMAMBA introduces a residual learning framework within the VMamba model to take advantage of both the global and local state features inherent in the original VMamba architecture design. The architecture begins with an interference module responsible for processing the input image, followed by a series of VSS blocks organized in four distinct stage sequences. Unlike the original VMamba framework, the Res-VMamba architecture uses the VMamba structure as its backbone and incorporates the raw data directly into the feature map. They refer to this integration as the global residual mechanism to distinguish it from the residual structure in a VSS block. This integration is designed to facilitate the sharing of information between the local details captured by individual VSS blocks and the overall global features in the unprocessed inputs, enhancing the characterization capabilities of the model and improving performance on tasks that require a comprehensive understanding of visual data.

To take advantage of the long sequence modeling capabilities of Mamba blocks and the spatiotemporal representation capabilities of LSTMs, VMRNN Cell eliminates all weights and biases in ConvLSTMs and uses VSS blocks to learn the spatial dependence of vertical directions. In VMRNN Cell, long-term and short-term time dependencies are captured by updating information about cell states and hidden states from a horizontal perspective. On the basis of VMRNN Cell, two variants are proposed: VMRNN-B and VMRNN-D. VMRNN-B relies primarily on stacking VMRNN layers, while VMRNN-D contains more VMRNN Cells and introduces Patch Merging and Patch Expanding layers. The Patch Merging layer is used for downsampling, effectively reducing the spatial dimension of the data, helping to reduce computational complexity and capturing more abstract, global features. Instead, the patch extension layer is used for upsampling, adding a spatial dimension to restore detail and enabling precise localization of features during the refactoring phase. Eventually, the refactoring layer receives the hidden state from the VMRNN layer and scales it back to the input size, generating a prediction frame for the next time step. Integrating downsampling and upsampling processes has important advantages in our predictive architecture. Downsampling simplifies the input representation, enabling the model to process higher-level features with lower computational overhead. This is especially beneficial for a more abstract understanding of complex patterns and relationships within data.

SSM-ViT blocks are used to efficiently process event-based information. It consists of three main components: the Self-Attention Block (Block-SA), the Extended Attention Block (Grid-SA), and the SSM Block. Block-SA focuses on immediate spatial relationships and provides a detailed representation of nearby features. Grid-SA provides a global perspective, capturing comprehensive spatial relationships and overall input structure. SSM blocks ensure time consistency and the transfer of continuous information between successive time steps. By combining SSM with self-attention, SSM-ViT blocks enable faster training and parameter timescale adjustment for time aggregation.

The Meet More Areas (MMA) block uses a MetaFormer-style architecture and includes two layers of normalization, a token mixer consisting of a channel attention mechanism and a ViM block in parallel, and an MLP block for deep feature extraction. There are two main reasons for choosing this structure: first, models with MetaFormer-style architectures have shown promising results, indicating the potential to achieve good results. Second, in order to make full use of and utilize the global information extracted by the ViM block, channel attention mechanisms are incorporated to activate more pixels, as global detail plays a role in determining channel attention weights. In addition, it is reasonable to assume that the use of convolution-based modules can enhance the visual representation obtained by ViM blocks and simplify the training process, similar to the benefits observed by Transformer. For recovery, the Residual State Space Blocks (RSSBs) block adds VSS blocks before the channel attention blocks, which allows the VSS to focus on learning diverse channel representations and then selecting critical channels with subsequent channel attention, thus avoiding channel redundancy.

Why did Mamba subvert Transformer's dominance in computer vision?

Mamba is gaining traction in the field of computer vision because of its ability to handle long-distance dependencies and its significant computational efficiency relative to transformers. As detailed in recent surveys, various methods have been developed to harness and explore Mamba's capabilities, reflecting the evolving advancements in the field.

We begin with a discussion of the basic concepts of SSM (Structured Sparse Matrices) and Mamba architectures, followed by a comprehensive analysis of various competing approaches across a range of computer vision applications. Our survey covers the latest Mamba models designed specifically for backbone architecture, high/intermediate vision, low-level vision, medical imaging, and remote sensing. This survey is the first review paper on the latest developments in SSMs and Mamba-based technologies, with a clear focus on computer vision challenges. Our goal is to generate more interest in the visual community in taking advantage of the possibilities of the Mamba model and finding solutions to current limitations.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪,无人机等。

Read on