【ShuffleNet V2】《ShuffleNet V2：Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

ECCV-2018

caffe 版代码：https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/blob/master/shufflenet_v2_x1.0.prototxt

caffe 代码可视化工具：http://ethereon.github.io/netscope/#/editor

文章目录

1 Background and Motivation
2 Advantages / Contributions
3 Innovations
4 Method
- 4.1 Practical Guidelines for Efficient Network Design
- - 4.1.1 G1：Equal channel width minimizes memory access cost (MAC)
  - 4.1.2 G2：Excessive group convolution increases MAC
  - 4.1.3 G3：Network fragmentation reduces degree of parallelism
  - 4.1.4 G4：Element-wise operations are non-negligible
- 4.2 ShuffleNet V2
5 Experiments
- 5.1 Datasets
- 5.2 Speed and Accuracy
6 Conclusion
7 补充

1 Background and Motivation

现在的网络结构设计都是 guided by indirect metric——FPLOs（float-point operations，指的是 the number of multiply-adds），这样会忽视影响 direct metric（speed or latency.）的一些因素，例如 memory access cost and platform characterics（eg: GPU, ARM ）. 从而设计出来的 architectures 可能是 sub-optimal 的。

证据：MobileNet v2 is much faster than NasNet-A，但是 FLOPs 相当，下面图能更好的说明，同样的 FLOPs，但是速度不一样（固定横坐标看）

作者 guided by direct metric（speed or latency）设计了新的 architectures，called ShuffleNet V2

效果如下

indirect metric 和 direct metric 的差别

1）several important factors that have considerable affection on speed are not taken into account by FLOPs.

memory access cost (MAC)：group convolution 占用很大
degree of parallelism（并行度）：同样的 FLOPs，高并行度会比低并行度快！

2）operations with the same FLOPs could have different running time, depending on the platform（GPU, ARM）.

eg：tensor decomposition reduces FLOPs by 75%，但更慢，CUDNN 的 version（对不同的操作有专门的优化）

2 Advantages / Contributions

4 个 Practical Guidelines for architecture design
提出的shufflenet v2，accuracy 在同 FLOPs 下一骑绝尘，inference time 也超群绝伦（仅次 mobilenet v1）

3 Innovations

indirect metric（FLOPs） to direct metric（Speed）
遵循 direct metric，提出 4 个 Practical Guidelines for architecture design
提出 Shuffle V2

4 Method

ShuffleNet 的细节可以参考【ShuffleNet】《ShuffleNet：An Extremely Efficient Convolutional Neural Network for Mobile Devices》

4.1 Practical Guidelines for Efficient Network Design

FLOPs metric only account for the convolution part, other part as follow

data I/O,
data shuffle
element-wise operations（add, ReLU）

4.1.1 G1：Equal channel width minimizes memory access cost (MAC)

depthwise separable convolutions 计算量主要集中在 1×1 上，

FLOPs 为 B = h w c 1 c 2 B = hwc_1c_2 B=hwc1c2（输入 h w c 1 hwc_1 hwc1，输出 h w c 2 hwc_2 hwc2）
MAC 为 h w ( c 1 + c 2 ) + c 1 ∗ c 2 hw(c_1+c2) + c1*c2 hw(c1+c2)+c1∗c2（input / output feature 和 kernel weights）

由 mean value inequality, a + b ≥ 2 a b a+b\geq2\sqrt{ab} a+b≥2ab

推

M A C ≥ 2 h w B + B h w MAC\geq 2\sqrt{hwB}+ \frac{B}{hw} MAC≥2hwB

+hwB

这个形式不方便看，代入 B B B 和 M A C MAC MAC 之后，便一目了然！！！

h w ( c 1 + c 2 ) + c 1 c 2 ⩾ 2 h w c 1 c 2 + c 1 c 2 hw(c_1+c_2)+c_1c_2\geqslant 2hw\sqrt{c_1c_2}+c_1c_2 hw(c1+c2)+c1c2⩾2hwc1c2

+c1c2

当 a = b a=b a=b，也即 c 1 = c 2 c_1 = c_2 c1=c2 的时候，等号成立，所以 input channels 等于 output channels 的话 MAC 最小！

作者做实验验证了一下！不同 platform 都表明， c 1 = c 2 c_1 = c_2 c1=c2 时，最快！

4.1.2 G2：Excessive group convolution increases MAC

1 ∗ 1 1*1 1∗1 group convolution 的 MAC 和 B（FLOPs）如下

M A C = h w ( c 1 + c 2 ) + c 1 c 2 g = h w c 1 + B g c 1 + B h w MAC=hw(c_1+c_2)+\frac{c_1c_2}{g}= hwc_1 + \frac{Bg}{c_1} + \frac{B}{hw} MAC=hw(c1+c2)+gc1c2=hwc1+c1Bg+hwB

其中 B 为

B = h w c 1 c 2 g B = \frac{hwc_1c_2}{g} B=ghwc1c2

可以看出，固定输入的时候，在相同的B的情况下，如果 g 越大，MAC 就越大！

注意一个细节，组越多，相同的 FLOPs 表示 output channels 也越多，因为，group convolution 会减少 1/g 的 FLOPs

可以看出，g 越大，越慢！因为同等输入和 FLOPs 下，增加了 MAC

4.1.3 G3：Network fragmentation reduces degree of parallelism

一些网络如Inception，以及Auto ML自动产生的网络NASNET-A，它们倾向于采用“多路”结构，即存在一个 block 中很多不同的卷积或者pooling。（参考 ShuffleNetV2：轻量级CNN网络中的桂冠）

碎片(Fragmentation)是指多分支上，每条分支上的小卷积或pooling等（如外面的一次大的卷积操作，被拆分到每个分支上分别进行小的卷积操作）。虽然这些Fragmented sturcture能够增加准确率，但是在高并行情况下降低了效率，增加了许多额外开销（内核启动、同步等等）。（参考 ShufflenetV2_高效网络的4条实用准则）

NasNet ：13

ResNet：2 or 3

4.1.4 G4：Element-wise operations are non-negligible

Element-wise operations：

ReLU
AddTensor
AddBias
depthwise convolution（作者这么认为的原因是 high MAC/FLOPs ratio）

small FLOPs but relatively heavy MAC

[5] 是 resnet

4.2 ShuffleNet V2

ShuffleNet V1 中

bottleneck-like structure（违背 G1）
point-wise group convolution（违背 G2）
too many groups （违背 G3）
shortcut connection（element-wise add 违背 G4）

ShuffleNet V2，channel split，表示 C C C → C − C ′ C-C' C−C′ and C ′ C' C′，two branch，或者说 two groups， C ′ C' C′ 作者设置为 C 2 \frac{C}{2} 2C

改进 bottleneck（1×1 → 3×3 → 1×1），保持 c 不变，克服 G1
改进 bottleneck, point-wise convolution 不要 group，克服 G2
channel split，一个 identity，一个改进后的 bottleneck，达到分组的效果（two groups），也避免的分组过多，克服 G3
channel split，一个 identity，一个改进后的 bottleneck，concatenation 而不是 add 克服 G4，也进一步保证了 G1

最后再 channel shuffle like Shufflenet V1，让 two branch 的 information communicate

add 不存在了，Element-wise operations like ReLU and depth-wise convolutions exist only in one branch.

整个 structure 如下：

ShuffleNet V2 不仅快，而且 accuracy

First, the high efficiency in each building block enables using more feature channels and larger network capacity.（同FLOPs 下，因为参数效率高，capacity 更猛）
有 C 2 \frac{C}{2} 2C 的 feature 通过 shortcut connection 到了下一个 block，可以被视为 a kind of feature reuse，like DenseNet 和 CondenseNet

feature reuse

（a）的图来自 densenet，可以参考【DenseNet】《Densely Connected Convolutional Networks》，表示每层对其它层的 l1-norm 后的 weight，一行一行的看，第一行依次表示 layer1 对 layer1到 layer last 的 weight，第二行依次表示 layer2 对 layer2 到 layer last ……

（b）是 shufflenet 中表示 s 层有多少 channels 作用到了 l 层，也是一行一行的看比较好，按照 C ′ C' C′ 为 C 2 \frac{C}{2} 2C 的设定，递进关系应该是 1，1/2，1/4，1/8……下去，看每一行颜色的递进关系也是如此！！！！

从（a）中可以看出 dense connection between all layers could introduce redundancy（因为相邻层的 weight 明显更强烈——更红，CondenseNet 也验证了这一点），而（b），指数级衰减！！！

5 Experiments

5.1 Datasets

ImageNet 2012 classification dataset
COCO for object detection

5.2 Speed and Accuracy

1）Accuracy vs. FLOPs

ShuffleNet V2 accuracy 没的说，都是傲视群雄，特别 under smaller computational budgets，然后在小的 FLOPs 上，MobileNet v2 效果很差，因为 too few channels

2）Inference Speed vs. FLOPs/Accuracy.

从表中可以看出：

GPU 上，高 FLOPs 下，shuffleNet v2 一骑绝尘

ARM 上，高 FLOPs 下，大家55开，但是低 FLOPs 下，MobileNet v2 有些“偃旗息鼓”了！（violate G1 和 G4，which is significant on mobile devices）

MobileNet 精度虽然一般，但是速度却超级快（超过 shufflenet v2），why？这是我第一次看 table 8 的最大的疑问！作者做出了如下的解释：

We believe this is because its structure satisfies most of proposed guidelines（e.g. for G3, the fragments of MobileNet v1 are even fewer than ShuffleNet v2).）

IGCV 2 和 IGCV 3 比较慢，为什么？

usage of too many convolution groups

这两个“意外”都在作者的 Practical Guidelines 中！

automatic model search 为啥速度慢

violate G3，usage of too many fragments，哈哈，结合作者设计的 guidelines，可以 search 出更好的 model 出来

精度上，shuffleNet v2 没的说！独领风骚！

3）Compatibility with other methods

配合 SE 提升了 0.5% 个点（25.1 to 24.6），FLOPs 从 591 to 597！

4）Generalization to Large Models

>2GFLOPs

可以说，更少的 FLOPs，更猛的效果！哈哈哈，怕是农药吧，含笑半步颠？一滴致命？

真是个如意金箍棒！！！

5）Object Detection

standard mmAP, i.e. the averaged mAPs at the box IoU thresholds from 0.5 to 0.95.

一个比较有趣的现象！

object detection performance：shufflenet v2 > Xception ≥ shufflenet v1 > mobilenet v2
classification： shufflenet v2 ≥ mobilenet v2 > shufflenet v1 > Xception

除了第一名，其它的反了，哈哈哈

This is probably due to the larger receptive field of Xception building blocks than the other counterparts (7 vs. 3).

下图是 xception 的结构，可以看出，module 5-12 的感受野都为7，比改进版 bottleneck 的感受野要大！

Xception 的介绍可以参考【Xception】《Xception: Deep Learning with Depthwise Separable Convolutions》

6 Conclusion

四个 Practical Guidelines

G1：Equal channel width minimizes memory access cost (MAC)

G2：Excessive group convolution increases MAC

G3：Network fragmentation reduces degree of parallelism

G4：Element-wise operations are non-negligible

ShuffleNet V1 中
- point-wise group convolution（违背 G2）
- bottleneck-like structure（违背 G1）
- too many groups （违背 G3）
- shortcut connection（element-wise add 违背 G4）
mobilenet v2
- inverted bottleneck structure（violates G1）
- depth-wise convolution （against G4）
- ReLU on “thick” feature map（against G4）
auto-generated structures（violate G3）

从 inference 看（table 8），mobilenet v1 设计的很 nice（按 G1-G4）来！

IGC 系列看完再回过头看看相关部分（usage of too many convolution groups）

AutoML 看完再回头看看相关部分（G3 到底如何理解？）

Objection detection 感受野的重要性！（table 7）

feature reuse（compare with DenseNet 值得借鉴）

7 补充

余霆嵩的回答 - 知乎 https://www.zhihu.com/question/287433673/answer/460247126 （软硬结合，双管齐下）

“软件”上主要考虑权值参数数量（占内存小）以及具体的运算方式（更小的flops），硬件上要考虑MAC（memory access cost ），并行度（degree of parallelis），数据的读取方式等等。（如何评价shufflenet V2？

【ShuffleNet V2】《ShuffleNet V2：Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

文章目录

1 Background and Motivation

2 Advantages / Contributions

3 Innovations

4 Method

4.1 Practical Guidelines for Efficient Network Design

4.1.1 G1：Equal channel width minimizes memory access cost (MAC)

4.1.2 G2：Excessive group convolution increases MAC

4.1.3 G3：Network fragmentation reduces degree of parallelism

4.1.4 G4：Element-wise operations are non-negligible

4.2 ShuffleNet V2

5 Experiments

5.1 Datasets

5.2 Speed and Accuracy

6 Conclusion

7 补充

继续阅读

【医学+深度论文：F12】2017Glaucoma-Deep: Detection of Glaucoma Eye Disease on Retinal Fundus Images using Dee12

【医学+深度论文：F26】2018 CVPR Performance assessment of the deep learning technologies in grading glaucoma26

tensorflow安装（持续更新）

CNN卷积神经网络学习笔记CNN训练过程：

大话CNN经典模型：GoogLeNet（从Inception v1到v4的演进）

GoogleNet网络详解与keras实现GoogleNet网络详解与keras实现

PyTorch ------GoogLeNet卷积神经网络实现mnist手写体识别

CNN实现mnist手写数字识别

基于CNN的MNIST手写数字识别

基于深度学习的车牌+车辆识别（YOLOv5和CNN）源码加文末QQ基于深度学习的车牌识别(YOLOv5和CNN）目录

融合RNN和CNN的文本分类模型

CNN文本分类原理讲解与实战

Pytorch学习笔记（8）———构建CNN网络(下)

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

深度学习之卷积神经网络(CNN) — 理论与代码结合

深度学习之卷积神经网络CNN及tensorflow代码实现示例详细介绍(转载) 深度学习之卷积神经网络CNN及tensorflow代码实现示例详细介绍