laitimes

I heard that you lack a GPU?

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

标题:Lightweight Deep Learning for Resource-Constrained Environments: A Survey

作者:Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng

Institutions: National Yangming Chiao Tung University, Jilin University, Multimedia University, Hon Hai Research Institute, National Taiwan University

Original link: https://arxiv.org/abs/2404.07236

Over the past decade, the dominance of deep learning has dominated various areas of AI, including natural language processing, computer vision, and biomedical signal processing. While model accuracy has improved significantly, deploying these models on lightweight devices, such as mobile phones and microcontrollers, is constrained by resource constraints. In this investigation, we provide comprehensive design guidelines specific to these devices, detailing the careful design, compression methods, and hardware acceleration strategies for lightweight models. The main goal of this work is to explore methods and concepts to overcome hardware limitations without compromising the accuracy of the model. In addition, we explore two significant avenues for lightweight deep learning in the future: TinyML and large language model deployment technologies. While these pathways undoubtedly have potential, they also pose significant challenges that encourage research into untapped areas.

The importance of neural networks (NNs) has risen dramatically in recent years, with their applications permeating all aspects of daily life and expanding to support complex tasks. However, since the release of AlexNet in 2012, there has been a widespread trend towards creating deeper, more complex networks to improve accuracy. For example, Model Soups achieved remarkable accuracy on the ImageNet dataset, but at the cost of more than 1.843 billion parameters. Similarly, GPT-4 excels on natural language processing (NLP) benchmarks, albeit with a whopping 1.76 trillion parameters. The computational demand for deep learning (DL) has increased dramatically, by about 300,000 times from 2012 to 2018. This dramatic increase in scale sets the stage for the challenges and developments explored in this article.

In response to the above practical needs, a large number of studies have emerged in recent years, focusing on lightweight modeling, model compression, and acceleration techniques. The annual Mobile Artificial Intelligence (MAI) Symposium has been held consecutively for CVPR 2021-2023, focusing on the deployment of DL models on resource-constrained devices, such as ARM Mali GPUs and image processing on Raspberry Pi 4. In addition, the Image Processing (AIM) workshops held at ICCV 2019, ICCV 2021, and ECCV 2022 organized around the challenges of image/video processing, restoration, and enhancement on mobile devices.

From this work, the authors found the most effective way to analyze the development of efficient lightweight models, from the design phase to the deployment phase, involving the incorporation of three key elements into the process: NN architecture design, compression methods, and hardware acceleration of lightweight DL models. Previous surveys have tended to focus only on specific aspects of the process, such as discussing only quantitative methods, providing detailed insights into these areas. However, these surveys may not provide a comprehensive understanding of the entire process and may overlook important alternative methods and techniques. In contrast, this review covers lightweight architectures, compression methods, and hardware acceleration algorithms.

The author examines classic lightweight architectures and classifies them into series for greater clarity. Some of these architectures have made significant progress by introducing innovative convolutional blocks. For example, deep separable convolutions prioritize high accuracy and reduced computational requirements. It is important to note that the parameters and FLOPs do not align with the inference time. Early lightweight architectures, such as SqueezeNet and MobileNet, were designed to reduce parameters and FLOPs. However, this reduction often increases the cost of memory access (MAC), resulting in slower inference. Therefore, the authors' goal is to promote the application of lightweight models by providing a more comprehensive and insightful review.

I heard that you lack a GPU?

In addition to lightweight architecture design, the authors mention a variety of efficient algorithms that can be applied to compress a given architecture. For example, quantization methods aim to reduce the amount of storage space required for data, often by replacing 8-bit floating-point numbers with 16-bit or 32-bit numbers, or even using binary values to represent data. The pruning algorithm, in its simplest form, removes parameters from the model to eliminate unnecessary redundancy within the network. However, more complex algorithms may remove entire channels or filters from the network. The Knowledge Distillation (KD) technique explores the concept of transferring knowledge from one model (called the "teacher") to another (called the "student"). The teacher represents a large pre-trained model with the required knowledge, while the student represents a smaller, untrained model that is responsible for extracting knowledge from the teacher. However, as the method evolved, some algorithms modified the method by using the same network twice, eliminating the need for additional teacher models. As these various compression methods progress, it is common to observe the adoption of two or more techniques, such as fusing trimming and quantization methods in the same model.

I heard that you lack a GPU?
I heard that you lack a GPU?
I heard that you lack a GPU?

In addition, the authors discuss neural architecture search (NAS) algorithms, a set of techniques designed to automate the model creation process while reducing human intervention. These algorithms search autonomously to define optimal factors within the search space, such as network depth and filter settings. Research in this area has focused on optimizing the definition, traversal, and evaluation of search spaces to achieve high accuracy without consuming too much time and resources.

The authors delve into the landscape of popular hardware accelerators dedicated to DL applications, including graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and tensor processing units (TPUs). In addition, various data stream types are described, and data locality optimization methods are discussed in depth, exploring the complex techniques that support efficient processing of DL workflows. Subsequently, popular DL libraries tailored to speed up the DL process are discussed. This review covers a diverse range of tools and frameworks that play an important role in optimizing the utilization of hardware accelerators. In addition, co-design solutions were investigated, and optimized and comprehensive results were achieved in terms of accelerating DL, requiring careful consideration of hardware architecture and compression methods.

I heard that you lack a GPU?
I heard that you lack a GPU?

This review explores complex areas such as lightweight models, compression methods, and hardware acceleration, demonstrating their advanced technical capabilities in a wide range of general-purpose applications. However, deploying these models in resource-constrained environments still presents significant challenges. This section is dedicated to revealing new techniques for accelerating and applying deep learning models in Micro Machine Learning (TinyML) and LLMs, focusing on unanswered questions that require further research.

TinyML is an emerging technology that enables deep learning algorithms to run on ultra-low-end IoT devices that consume less than 1mW. However, the extremely constrained hardware environment makes it challenging to design and develop TinyML models. MCUs are predominantly used in low-end IoT devices because they are more cost-effective than CPUs and GPUs. However, MCU libraries (such as CMSIS-NN and TinyEngine) are often platform-dependent, unlike GPU libraries such as PyTorch and TensorFlow that offer cross-platform support. As a result, TinyML's design focus is more on specialized applications rather than facilitating generic research, which can hinder the speed of overall research progress.

MCU-based libraries. Due to the resource-constrained environment in TinyML, MCU-based libraries are often designed for specific use cases. For example, CMSIS-NN is a pioneering work of an MCU-based library developed on ARM Cortex-M devices, which proposes an efficient kernel divided into NNfunctions and NNsupportfunctions. NNfunctions executes major functions in the network, such as convolution, pooling, and activation. NNsupportfunctions contains data transformation and activation tables. CMIX-NN proposes an open-source hybrid and low-precision tool that can quantize the weights and activations of models to 8, 4, and 2 digits at any number of bits. MCUNet proposes a co-design framework for DL implementations of commercial MCUs. The framework integrates TinyNAS to efficiently search for the most accurate and lightweight models. In addition, it leverages TinyEngine, which includes code generator-based compilation and in-place deep convolution, effectively addressing memory constraints. MCUNetV2 introduces a patch-based inference mechanism that only runs on small spatial areas of the feature map, further reducing peak memory usage. MicroNet employs differentiable NAS (DNAS) to search for efficient models with a low number of operations and supports the open-source platform Tensorflow Lite Micro (TFLM). MicroNet achieves state-of-the-art results on all TinyMLperf industry-standard benchmark tasks, i.e., visual wake-up words, Google voice commands, and anomaly detection.

What's holding TinyML back to growth? Despite the progress, TinyML's growth has been constrained by several inherent key constraints, including resource constraints, hardware and software heterogeneity, and a lack of datasets. Extreme resource constraints, such as the extremely small size of SRAM and the size of less than 1MB of flash memory, present challenges when designing and deploying TinyML models on edge devices. In addition, due to hardware heterogeneity and lack of framework compatibility, current TinyML solutions are adapted to accommodate each individual device, complicating the widespread deployment of TinyML algorithms. In addition, existing datasets may not be suitable for TinyML architectures because the data may not match the data generation characteristics of sensors outside the edge device. A set of standard datasets suitable for training TinyML models is needed to drive the development of effective TinyML systems. These open research challenges need to be addressed before they can be deployed at scale on IoT and edge devices.

Construction of lightweight large language models. Over the past two years, LLMs have consistently excelled in a variety of assignments. LLMs have important application potential in practice, especially when paired with human supervision. For example, they can serve as co-pilots for autonomous agents, or as a source of inspiration and advice. However, these models often have parameters on the scale of billions. Deploying such models into inference typically requires GPU-level hardware and tens of gigabytes of memory, which poses a significant challenge for day-to-day LLM utilization. For example, Tao et al. found it difficult to quantify generative pretrained language models because word embeddings were homogeneous and had different weight distributions. Therefore, the transformation of large, resource-intensive LLM models into compact versions suitable for deployment on resource-constrained mobile devices has become a prominent direction for future research.

World-renowned companies have made significant progress in LLM deployments. In 2023, Qualcomm demonstrated text-to-image model stabilization diffusion and independent execution of image-to-image model control networks on mobile devices, accelerating the deployment of large models to edge computing environments. Google has also unveiled several versions of its latest general-purpose megamodel, the PaLM 2, which includes a lightweight variant tailored for mobile platforms. This advancement creates new opportunities for migrating large models from cloud-based systems to edge devices. However, some large models still require tens of gigabytes of physical storage and runtime memory. As a result, efforts are being made to achieve a memory footprint of less than 1GB, which means that there is still a lot of work to be done in this area. This section outlines some of the key initiatives to simplify LLM implementation in resource-constrained environments.

Pruning without retraining. Recently, a large number of efforts have been made to construct lightweight LLMs using common DL quantification and pruning techniques. Some methods focus on quantization, where numerical precision is greatly reduced. For the first time, SparseGPT demonstrated that it is possible to prune a large-scale generative pretrained Transformer (GPT) model to at least 50% sparsity in a single step without any subsequent retraining with minimal loss of accuracy. Subsequently, Wanda (Pruning by Weights and Activations) was specifically designed to introduce sparsity in pre-trained LLMs. Wanda pruned with minimal weights and no need for retraining or weight updates. The pruned LLM can be used directly, increasing its practicability. Notably, Wanda has surpassed the established baseline of amplitude pruning and competed effectively with recent methods that have involved a large number of weight updates. These efforts set important milestones for the future design of LLM pruning methods without the need for retraining.

Model design. From a model design perspective, lightweight LLMs can be created from the start, with a focus on reducing the number of model parameters. A promising approach in this regard is cue tuning, which aims to optimize the performance of LLMs while maintaining efficiency and model size. A notable approach in this regard is Visual Cue Tuning (VPT), which becomes an efficient and effective alternative to comprehensive fine-tuning of large-scale Transformer models in vision-related tasks. VPT introduces a small fraction of the trainable parameters in the input space, less than 1%, while maintaining the integrity of the model backbone. Another notable contribution is CALIP, which introduces parameter-free attention mechanisms to facilitate effective interaction and communication between visual and textual features. It generates text-aware image features and visually-guided text features, which contributes to the development of more concise and efficient visual-language models. In the near future, a promising avenue to advance the design of lightweight LLMs is to develop adaptive fine-tuning strategies. These policies will dynamically adjust the architecture and parameters of the model to align with specific task requirements. This adaptability ensures that the model is able to optimize its performance in a specific application without unnecessary parameter bloat.

Build a lightweight diffusion model. In recent years, generative models based on denoising diffusion, especially score-based models, have made significant progress in creating diverse and realistic data. However, shifting the inference phase of the diffusion model to the edge device presents significant challenges. The inference phase reverses the conversion process to generate real data from Gaussian noise, often referred to as the denoising process. In addition, when these models are compressed to reduce their footprint and computational requirements, there is a risk of severely degrading image quality. The compression process may require simplifying, approximating, or even removing necessary model components, which can adversely affect the model's ability to accurately reconstruct data from Gaussian noise. As a result, the development of diffusion models in resource-constrained scenarios presents a critical issue between reducing the size of the model while maintaining high-quality image generation.

Deploy Vision Transformer (ViTs). Despite the increasing prevalence of lightweight ViTs, deploying ViT in hardware-constrained environments remains an ongoing concern. According to reports, the latency and energy consumption of ViT inference on mobile devices are 40 times higher than that of CNN models. Therefore, mobile devices cannot support ViT inference without modification. The self-attention operation in ViTs needs to calculate the pairwise relationship between image patches, and the amount of computation increases quadratically as the number of patches increases. In addition, the FFN layer takes longer to compute as compared to the attention layer. By removing redundant attention heads and FFN layers, DeiT-Tiny can reduce latency by 23.2% with little to no loss of 0.75% accuracy.

Some work has designed NLP models for embedded systems such as FPGAs. Recently, DiVIT and VAQF proposed a hardware-software co-design solution for ViTs. DiVIT proposes an incremental patch coding that takes advantage of differential attention of patch locality at the algorithm level. In DiVIT, a differential attention processing engine array with differential data stream communication is designed using bit-saving techniques. In addition, exponential operations are performed using lookup tables with no additional calculations and minimal hardware overhead. For the first time, VAQF introduces binarization into ViTs, which can be used for FPGA mapping and quantization training. Specifically, VAQF can generate the required quantization accuracy and accelerator description based on the target frame rate for direct software and hardware implementation.

To seamlessly deploy ViTs in resource-constrained devices, the authors highlight two potential future directions:

1) Algorithm optimization. In addition to the described design-efficient ViT model, the bottlenecks of ViTs should also be considered. For example, MatMul operations can be sped up or reduced because they create bottlenecks in ViTs. In addition, improvements in integer quantification and computational fusion can be considered.

2) Hardware accessibility. Unlike CNNs, which are supported by most mobile devices and AI accelerators, ViTs does not have dedicated hardware support. For example, ViT does not work on mobile GPUs and Intel NCS2 VPUs. According to our findings, some important operators are not supported on specific hardware. Specifically, on a mobile GPU, the connection operator requires a 4-dimensional input tensor in TFLiteGPUDelegate, but a 3-dimensional tensor in ViTs. On the other hand, Intel VPU does not support LayerNorm, which is present in transformer architectures but not common in CNNs. Therefore, further investigation of the hardware support of ViTs on resource-constrained devices is required.

The purpose of this article is to describe simply but accurately how lightweight architectures, compression methods, and hardware techniques can be leveraged to achieve accurate models on resource-constrained devices. The main contributions are summarized below:

(1) Previous surveys have only briefly mentioned a small number of lightweight works. The author organizes lightweight architectures into series, such as grouping MobileNetV1-V3 and MobileNeXt into the MobileNet family, and provides a history of lightweight architectures from their birth to the present.

(2) To cover the entire lightweight DL application, this article also covers compression and hardware acceleration methods. Unlike many other surveys that do not explicitly establish the links between these technologies, a comprehensive overview of each area is provided, providing insight into how they relate to each other.

(3) As part of the frontier advances in lightweight DL, the authors review current challenges and explore future work. First, TinyML is explored, an emerging approach designed for deploying DL models on devices with extremely limited resources. Subsequently, various contemporary initiatives to leverage LLMs on edge devices are investigated, which is a promising direction in the field of lightweight DL.

Recently, computer vision applications have increasingly focused on energy savings, carbon footprint reduction, and cost-effectiveness, highlighting the importance of lightweight models in the context of AI at the edge. This article provides a comprehensive examination of lightweight deep learning (DL), exploring important models such as MobileNet and Efficient transformer variants, as well as popular strategies for optimizing these models, including pruning, quantization, knowledge distillation, and neural structure search. In addition to explaining these methods in detail, practical guidance is provided for customizing lightweight models, providing clarity by analyzing their respective strengths and weaknesses. In addition, the hardware acceleration of DL models is discussed, and the hardware architecture, different data stream types, and data locality optimization techniques are discussed in depth to enhance the understanding of the accelerated training and inference process. This investigation reveals the complex interactions between hardware and software (co-design), providing insights to accelerate the training and inference process from a hardware perspective. Finally, the authors turn their gaze to the future, recognizing that there are challenges in deploying lightweight DL models in TinyML and LLM technologies, and that creative solutions need to be explored in these evolving areas.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

Read on