MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
Highly Efficient Salient Object Detection with 100K Parameters
[ECCV 2020] [Code]
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
Abstract
Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices.
In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree.
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with ∼ 0.2% parameters (100k) of large models on popular salient object detection benchmarks.
Salient object detection (SOD) is an important computer vision task with various applications in image retrieval [17,5], visual tracking [23], photographic composition [15], image quality assessment [69], and weakly supervised semantic segmentation [25]. While convolutional neural networks (CNNs) based SOD methods have made significant progress, most of these methods focus on improving the state-of-the-art (SOTA) performance, by utilizing both fine details and global semantics [64,80,83,76,11], attention [3,2], as well as edge cues [12,68,85,61] etc. Despite the great performance, these models are usually resource-hungry, which are hardly applicable on low-power devices with limited storage/computational capability. How to build an extremely light-weighted SOD model with SOTA performance is an important but less explored area.
The SOD task requires generating accurate prediction scores for every image pixel, thus requires both large scale high level feature representations for correctly locating the salient objects, as well as fine detailed low level representations for precise boundary refinement [12,67,24]. There are two major challenges towards building an extremely light-weighted SOD models. Firstly, serious redundancy could appear when the low frequency nature of high level feature meets the high output resolution of saliency maps. Secondly, SOTA SOD models [44,72,12,84,46,10] usually rely on ImageNet pre-trained backbone architectures [19,13] to extract features, which by itself is resource-hungry.
Very recently, the spatial redundancy issue of low frequency features has also been noticed by Chen et al. [4] in the context of image and video classification. As a replacement of vanilla convolution, they design an OctConv operation to process feature maps that vary spatially slower at a lower spatial resolution to reduce computational cost. However, directly using OctConv [4] to reduce redundancy issue in the SOD task still faces two major challenges. 1) Only utilizing two scales, i.e., low and high resolutions as in OctConv, is not sufficient to fully reduce redundancy issues in the SOD task, which needs much stronger multi-scale representation ability than classification tasks. 2) The number of channels for each scale in OctConv is manually selected, requiring lots of efforts to re-adjust for saliency model as SOD task requires less category information.
In this paper, we propose a generalized OctConv (gOctConv) for building an extremely light-weighted SOD model, by extending the OctConv in the following aspects: 1). The flexibility to take inputs from arbitrary number of scales, from both in-stage features as well as cross-stages features, allows a much larger range of multi-scale representations. 2). We propose a dynamic weight decay scheme to support learnable number of channels for each scale, allowing 80% of parameters reduce with negligible performance drop.
Benefiting from the flexibility and efficiency of gOctConv, we propose a highly light-weighted model, namely CSNet, that fully explores both in-stage and Cross-Stages multi-scale features. As a bonus to the extremely low number of parameters, our CSNet could be directly trained from scratch without ImageNet pre-training, avoiding the unnecessary feature representations for distinguishing between various categories in the recognition task.
In summary, we make two major contributions in this paper:
– We propose a flexible convolutional module, namely gOctConv, to efficiently utilize both in-stage and cross-stages multi-scale features for SOD task, while reducing the representation redundancy by a novel dynamic weight decay scheme.
– Utilizing gOctConv, we build an extremely light-weighted SOD model, namely CSNet, which achieves comparable performance with ∼ 0.2% parameters (100k) of SOTA large models on popular SOD benchmarks.
Originally designed to be a replacement of traditional convolution unit, the vanilla OctConv [4] shown in Fig. 2 (a) conducts the convolution operation across high/low scales within a stage. However, only two-scales within a stage can not introduce enough multi-scale information required for SOD task. The channels for each scale in vanilla OctConv is manually set, requires lots of efforts to re-adjust for saliency model as SOD task requires less category information. Therefore, we propose a generalized OctConv (gOctConv) allows arbitrary number of input resolutions from both in-stage and cross-stages conv features with learnable number of channels as shown in Fig. 2 (b). As a generalized version of vanilla OctConv, gOctConv improves the vanilla OctConv for SOD task in following aspects: 1). Arbitrary numbers of input and output scales is available to support a larger range of multi-scale representation. 2). Except for in-stage features, the gOctConv can also process cross-stages features with arbitrary scales from the feature extractor. 3). The gOctConv supports learnable channels for each scale through our proposed dynamic weight decay assisting pruning scheme. 4). Cross-scales feature interaction can be turned off to support a large complexity flexibility. The flexible gOctConv allows many instances under different designing requirements. We will give a detailed introduction of different instances of gOctConvs in following light-weighted model designing.
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
1. gOctConvs:Light-weighted Model Composed of gOctConvs
2. dynamic weight decay scheme:Learnable Channels for gOctConv
Light-weighted Model Composed of gOctConvs
Overview.
As shown in Fig. 3, our proposed light-weighted network, consisting of a feature extractor and a cross-stages fusion part, synchronously processes features with multiple scales. The feature extractor is stacked with our proposed in-layer multi-scale block, namely ILBlocks, and is split into 4 stages according to the resolution of feature maps, where each stage has 3, 4, 6, and 4 ILBlocks, respectively. The cross-stages fusion part, composed of gOctConvs, processes features from stages of the feature extractor to get a high-resolution output.
ILBlock enhances the multi-scale representation of features within a stage. gOctConvs are utilized to introduce multi-scale within ILBlock. Vanilla OctConv requires about 60% FLOPs [4] to achieves the similar performance to standard convolution, which is not enough for our objective of designing a highly light-weighted model. To save computational cost, interacting features with different scales in every layer is unnecessary. Therefore, we apply an instance of gOctConv that each input channel corresponds to an output channel with the same resolution through eliminating the cross scale operations. A depthwise operation within each scale in utilized to further save computational cost. This instance of gOctConv only requires about 1/channel FLOPs compared with vanilla OctConv.ILBlock is composed of a vanilla OctConv and two 3 × 3 gOctConvs as shown in Fig. 3. Vanilla OctConv interacts features with two scales and gOctConvs extract features within each scale. Multi-scale features within a block are separately processed and interacted alternately. Each convolution is followed by the BatchNorm [30] and PRelu [18]. Initially, we roughly double the channels of ILBlocks as the resolution decreases, except for the last two stages that have the same channel number. Unless otherwise stated, the channels for different scales in ILBlocks are set evenly. Learnable channels of OctConvs then are obtained through the scheme as described in Sec. 3.3.
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
Fig. 3. Illustration of our salient object detection pipeline, which uses gOctConv to extract both in-stage and cross-stages multi-scale features in a highly efficient way.
To retain a high output resolution, common methods retain high feature resolution on high-level of the feature extractor and construct complex multi-level aggregation module, inevitably increase the computational redundancy. While the value of multi-level aggregation is widely recognized [16,43], how to efficiently and concisely achieve it remains challenging. Instead, we simply use gOctConvs to fuse multi-scale features from stages of the feature extractor and generate the high-resolution output. As a trade-off between efficiency and performance, features from last three stages are used. A gOctConv 1 × 1 takes features with different scales from the last conv of each stage as input and conducts a cross-stages convolution to output features with different scales. To extract multi-scale features at a granular level, each scale of features is processed by a group of parallel convolutions with different dilation rates. Features are then sent to another gOctConv 1 × 1 to generate features with the highest resolution. Another standard conv 1 × 1 outputs the prediction result of saliency map. Learnable channels of gOctConvs in this part are also obtained.
We propose to get learnable channels for each scale in gOctConv by utilizing our proposed dynamic weight decay assisted pruning during training. Dynamic weight decay maintains a stable weights distribution among channels while introducing sparsity, helping pruning algorithms to eliminate redundant channels with negligible performance drop.
The commonly used regularization trick weight decay [33,77] endows CNNs with better generalization performance. Mehta et al.[53] shows that weight decay introduces sparsity into CNNs, which helps to prune unimportant weights. Training with weight decay makes unimportant weights in CNN have values close to zero. Thus, weight decay has been widely used in pruning algorithms to introduce sparsity [38,50,48,22,47,21]. The common implementation of weight decay is by adding the L2 regularization to the loss function, which can be written as follows:
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
where L0 is the loss for the specific task, wi is the weight of ith layer, and λ is the weight for weight decay. During back propagation, the weight wi is updated as
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
where ∇fi (wi) is the gradient to be updated, and λwi is the decay term, which is only associated with the weight itself. Applying a large decay term enhances sparsity, and meanwhile inevitably enlarges the diversity of weights among channels. Fig. 4 (a) shows that diverse weights cause unstable distribution of outputs among channels. Ruan et al.[8] reveals that channels with diverse outputs are more likely to contain noise, leading to biased representation for subsequent filters. Attention mechanisms have been widely used to re-calibrate the diverse outputs with extra blocks and computational cost [29,8]. We propose to relieve diverse outputs among channels with no extra cost during inference. We argue that the diverse outputs are mainly caused by the indiscriminate suppression of decay terms to weights. Therefore, we propose to adjust the weight decay based on specific features of certain channels. Specifically, during back propagation, decay terms are dynamically changed according to features of certain channels. The weight update of the proposed dynamic weight decay is written as
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
where λd is the weight of dynamic weight decay, xi denotes the features calculated by wi, and S (xi) is the metric of the feature, which can have multiple definitions depending on the task. In this paper, our goal is to stabilize the weight distribution among channels according to features. Thus, we simply use the global average pooling (GAP) [42] as the metric for a certain channel:
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
where H and W are the height and width of the feature map xi. The dynamic weight decay with the GAP metric ensures that the weights producing large value features are suppressed, giving a compact and stable weights and outputs distribution as revealed in Fig. 4. Also, the metric can be defined as other forms to suit certain tasks as we will study in our future work. Please refer to Sec. 4.3 for a more detailed interpretation of dynamic weight decay.
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
多了一个 S(xi)。S (xi) 是特性的度量标准,它可以根据任务有多个定义。本文的目标是根据特征稳定信道间的权重分布。因此,本文使用全局平均池 (GAP)[42] 作为特定通道的度量,如上面最后一个公式。使用 GAP 度量的动态权值衰减保证了产生大值特征的权值被抑制,给出了如图4所示的紧凑和稳定的权值和输出分布。
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
Learnable channels with model compression.
Now, we incorporate dynamic weight decay with pruning algorithms to remove redundant weights, so as to get learnable channels of each scale in gOctConvs. In this paper, we follow [48] to use the weight of BatchNorm layer as the indicator of the channel importance. The BatchNorm operation [30] is written as follows:
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters
where x and y are input and output features, E(x) and Var(x) are the mean and variance, respectively, and is a small factor to avoid zero variance. γ and β are learned factors.We apply the dynamic weight decay to γ during training. Fig. 4 (b) reveals that there is a clear gap between important and redundant weights, and unimportant weights are suppressed to nearly zero (wi < 1e−20). Thus, we can easily remove channels whose γ is less than a small threshold. The learnable channels of each resolution features in gOctConv are obtained. The algorithm of getting learnbale channels of gOctConvs is illustrated in Alg. 1.
MyDLNote-High-Resolution: gOctConv:100K参数实现高效显著性目标检测Highly Efficient Salient Object Detection with 100K Parameters