天天看点

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

ECCV-2018

caffe 版代码:https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/blob/master/shufflenet_v2_x1.0.prototxt

caffe 代码可视化工具:http://ethereon.github.io/netscope/#/editor

文章目录

  • 1 Background and Motivation
  • 2 Advantages / Contributions
  • 3 Innovations
  • 4 Method
    • 4.1 Practical Guidelines for Efficient Network Design
      • 4.1.1 G1:Equal channel width minimizes memory access cost (MAC)
      • 4.1.2 G2:Excessive group convolution increases MAC
      • 4.1.3 G3:Network fragmentation reduces degree of parallelism
      • 4.1.4 G4:Element-wise operations are non-negligible
    • 4.2 ShuffleNet V2
  • 5 Experiments
    • 5.1 Datasets
    • 5.2 Speed and Accuracy
  • 6 Conclusion
  • 7 补充

1 Background and Motivation

现在的网络结构设计都是 guided by indirect metric——FPLOs(float-point operations,指的是 the number of multiply-adds),这样会忽视影响 direct metric(speed or latency.)的一些因素,例如 memory access cost and platform characterics(eg: GPU, ARM ). 从而设计出来的 architectures 可能是 sub-optimal 的。

证据:MobileNet v2 is much faster than NasNet-A,但是 FLOPs 相当,下面图能更好的说明,同样的 FLOPs,但是速度不一样(固定横坐标看)

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

作者 guided by direct metric(speed or latency) 设计了新的 architectures,called ShuffleNet V2

效果如下

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

indirect metric 和 direct metric 的差别

1)several important factors that have considerable affection on speed are not taken into account by FLOPs.

  • memory access cost (MAC):group convolution 占用很大
  • degree of parallelism(并行度):同样的 FLOPs,高并行度会比低并行度快!

2)operations with the same FLOPs could have different running time, depending on the platform(GPU, ARM).

  • eg:tensor decomposition reduces FLOPs by 75%,但更慢,CUDNN 的 version(对不同的操作有专门的优化)

2 Advantages / Contributions

  • 4 个 Practical Guidelines for architecture design
  • 提出 的shufflenet v2,accuracy 在同 FLOPs 下一骑绝尘,inference time 也超群绝伦(仅次 mobilenet v1)

3 Innovations

  • indirect metric(FLOPs) to direct metric(Speed)
  • 遵循 direct metric,提出 4 个 Practical Guidelines for architecture design
  • 提出 Shuffle V2

4 Method

ShuffleNet 的细节可以参考 【ShuffleNet】《ShuffleNet:An Extremely Efficient Convolutional Neural Network for Mobile Devices》

4.1 Practical Guidelines for Efficient Network Design

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

FLOPs metric only account for the convolution part, other part as follow

  • data I/O,
  • data shuffle
  • element-wise operations(add, ReLU)

4.1.1 G1:Equal channel width minimizes memory access cost (MAC)

depthwise separable convolutions 计算量主要集中在 1×1 上,

  • FLOPs 为 B = h w c 1 c 2 B = hwc_1c_2 B=hwc1​c2​(输入 h w c 1 hwc_1 hwc1​,输出 h w c 2 hwc_2 hwc2​)
  • MAC 为 h w ( c 1 + c 2 ) + c 1 ∗ c 2 hw(c_1+c2) + c1*c2 hw(c1​+c2)+c1∗c2(input / output feature 和 kernel weights)

由 mean value inequality, a + b ≥ 2 a b a+b\geq2\sqrt{ab} a+b≥2ab

​ 推

M A C ≥ 2 h w B + B h w MAC\geq 2\sqrt{hwB}+ \frac{B}{hw} MAC≥2hwB

​+hwB​

这个形式不方便看,代入 B B B 和 M A C MAC MAC 之后,便一目了然!!!

h w ( c 1 + c 2 ) + c 1 c 2 ⩾ 2 h w c 1 c 2 + c 1 c 2 hw(c_1+c_2)+c_1c_2\geqslant 2hw\sqrt{c_1c_2}+c_1c_2 hw(c1​+c2​)+c1​c2​⩾2hwc1​c2​

​+c1​c2​

当 a = b a=b a=b,也即 c 1 = c 2 c_1 = c_2 c1​=c2​ 的时候,等号成立,所以 input channels 等于 output channels 的话 MAC 最小!

作者做实验验证了一下!不同 platform 都表明, c 1 = c 2 c_1 = c_2 c1​=c2​ 时,最快!

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

4.1.2 G2:Excessive group convolution increases MAC

1 ∗ 1 1*1 1∗1 group convolution 的 MAC 和 B(FLOPs)如下

M A C = h w ( c 1 + c 2 ) + c 1 c 2 g = h w c 1 + B g c 1 + B h w MAC=hw(c_1+c_2)+\frac{c_1c_2}{g}= hwc_1 + \frac{Bg}{c_1} + \frac{B}{hw} MAC=hw(c1​+c2​)+gc1​c2​​=hwc1​+c1​Bg​+hwB​

其中 B 为

B = h w c 1 c 2 g B = \frac{hwc_1c_2}{g} B=ghwc1​c2​​

可以看出,固定输入的时候,在相同的B的情况下,如果 g 越大,MAC 就越大!

注意一个细节,组越多,相同的 FLOPs 表示 output channels 也越多,因为,group convolution 会减少 1/g 的 FLOPs

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

可以看出,g 越大,越慢!因为同等输入和 FLOPs 下,增加了 MAC

4.1.3 G3:Network fragmentation reduces degree of parallelism

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

一些网络如Inception,以及Auto ML自动产生的网络NASNET-A,它们倾向于采用“多路”结构,即存在一个 block 中很多不同的卷积或者pooling。(参考 ShuffleNetV2:轻量级CNN网络中的桂冠)

碎片(Fragmentation)是指多分支上,每条分支上的小卷积或pooling等(如外面的一次大的卷积操作,被拆分到每个分支上分别进行小的卷积操作)。虽然这些Fragmented sturcture能够增加准确率,但是在高并行情况下降低了效率,增加了许多额外开销(内核启动、同步等等)。(参考 ShufflenetV2_高效网络的4条实用准则)

NasNet :13

ResNet:2 or 3

4.1.4 G4:Element-wise operations are non-negligible

Element-wise operations:

  • ReLU
  • AddTensor
  • AddBias
  • depthwise convolution(作者这么认为的原因是 high MAC/FLOPs ratio)

small FLOPs but relatively heavy MAC

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

[5] 是 resnet

4.2 ShuffleNet V2

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充
【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

ShuffleNet V1 中

  • bottleneck-like structure(违背 G1)
  • point-wise group convolution(违背 G2)
  • too many groups (违背 G3)
  • shortcut connection(element-wise add 违背 G4)

ShuffleNet V2,channel split,表示 C C C → C − C ′ C-C' C−C′ and C ′ C' C′,two branch,或者说 two groups, C ′ C' C′ 作者设置为 C 2 \frac{C}{2} 2C​

  • 改进 bottleneck(1×1 → 3×3 → 1×1),保持 c 不变,克服 G1
  • 改进 bottleneck, point-wise convolution 不要 group,克服 G2
  • channel split,一个 identity,一个改进后的 bottleneck,达到分组的效果(two groups),也避免的分组过多,克服 G3
  • channel split,一个 identity,一个改进后的 bottleneck,concatenation 而不是 add 克服 G4,也进一步保证了 G1

最后再 channel shuffle like Shufflenet V1,让 two branch 的 information communicate

add 不存在了,Element-wise operations like ReLU and depth-wise convolutions exist only in one branch.

整个 structure 如下:

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

ShuffleNet V2 不仅快,而且 accuracy

  • First, the high efficiency in each building block enables using more feature channels and larger network capacity.(同FLOPs 下,因为参数效率高,capacity 更猛)
  • 有 C 2 \frac{C}{2} 2C​ 的 feature 通过 shortcut connection 到了下一个 block,可以被视为 a kind of feature reuse,like DenseNet 和 CondenseNet

feature reuse

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充
【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

(a)的图来自 densenet,可以参考 【DenseNet】《Densely Connected Convolutional Networks》,表示每层对其它层的 l1-norm 后的 weight,一行一行的看,第一行依次表示 layer1 对 layer1到 layer last 的 weight,第二行依次表示 layer2 对 layer2 到 layer last ……

(b)是 shufflenet 中表示 s 层有多少 channels 作用到了 l 层,也是一行一行的看比较好,按照 C ′ C' C′ 为 C 2 \frac{C}{2} 2C​ 的设定,递进关系应该是 1,1/2,1/4,1/8……下去,看每一行颜色的递进关系也是如此 !!!!

从(a)中可以看出 dense connection between all layers could introduce redundancy(因为相邻层的 weight 明显更强烈——更红,CondenseNet 也验证了这一点),而(b),指数级衰减!!!

5 Experiments

5.1 Datasets

  • ImageNet 2012 classification dataset
  • COCO for object detection

5.2 Speed and Accuracy

1)Accuracy vs. FLOPs

ShuffleNet V2 accuracy 没的说,都是傲视群雄,特别 under smaller computational budgets,然后在小的 FLOPs 上,MobileNet v2 效果很差,因为 too few channels

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

2)Inference Speed vs. FLOPs/Accuracy.

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

从表中可以看出:

GPU 上,高 FLOPs 下,shuffleNet v2 一骑绝尘

ARM 上,高 FLOPs 下,大家55开,但是低 FLOPs 下,MobileNet v2 有些“偃旗息鼓”了!(violate G1 和 G4,which is significant on mobile devices)

MobileNet 精度虽然一般,但是速度却超级快(超过 shufflenet v2),why? 这是我第一次看 table 8 的最大的疑问!作者做出了如下的解释:

We believe this is because its structure satisfies most of proposed guidelines(e.g. for G3, the fragments of MobileNet v1 are even fewer than ShuffleNet v2).)

IGCV 2 和 IGCV 3 比较慢,为什么?

usage of too many convolution groups

这两个“意外”都在作者的 Practical Guidelines 中!

automatic model search 为啥速度慢

violate G3,usage of too many fragments,哈哈,结合作者设计的 guidelines,可以 search 出更好的 model 出来

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

精度上,shuffleNet v2 没的说!独领风骚!

3)Compatibility with other methods

配合 SE 提升了 0.5% 个点(25.1 to 24.6),FLOPs 从 591 to 597!

4)Generalization to Large Models

>2GFLOPs

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

可以说,更少的 FLOPs,更猛的效果!哈哈哈,怕是农药吧,含笑半步颠?一滴致命?

真是个如意金箍棒!!!

5)Object Detection

standard mmAP, i.e. the averaged mAPs at the box IoU thresholds from 0.5 to 0.95.

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

一个比较有趣的现象!

  • object detection performance:shufflenet v2 > Xception ≥ shufflenet v1 > mobilenet v2
  • classification: shufflenet v2 ≥ mobilenet v2 > shufflenet v1 > Xception

除了第一名,其它的反了,哈哈哈

This is probably due to the larger receptive field of Xception building blocks than the other counterparts (7 vs. 3).

下图是 xception 的结构,可以看出,module 5-12 的感受野都为7,比改进版 bottleneck 的感受野要大!

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》1 Background and Motivation2 Advantages / Contributions3 Innovations4 Method5 Experiments6 Conclusion7 补充

Xception 的介绍可以参考 【Xception】《Xception: Deep Learning with Depthwise Separable Convolutions》

6 Conclusion

四个 Practical Guidelines

G1:Equal channel width minimizes memory access cost (MAC)

G2:Excessive group convolution increases MAC

G3:Network fragmentation reduces degree of parallelism

G4:Element-wise operations are non-negligible

  • ShuffleNet V1 中
    • point-wise group convolution(违背 G2)
    • bottleneck-like structure(违背 G1)
    • too many groups (违背 G3)
    • shortcut connection(element-wise add 违背 G4)
  • mobilenet v2
    • inverted bottleneck structure(violates G1)
    • depth-wise convolution (against G4)
    • ReLU on “thick” feature map(against G4)
  • auto-generated structures(violate G3)

从 inference 看(table 8),mobilenet v1 设计的很 nice(按 G1-G4)来!

IGC 系列看完再回过头看看相关部分(usage of too many convolution groups)
AutoML 看完再回头看看相关部分(G3 到底如何理解?)
Objection detection 感受野的重要性!(table 7)
feature reuse(compare with DenseNet 值得借鉴)

7 补充

  • 余霆嵩的回答 - 知乎 https://www.zhihu.com/question/287433673/answer/460247126 (软硬结合,双管齐下)
“软件”上主要考虑权值参数数量(占内存小)以及具体的运算方式(更小的flops),硬件上要考虑MAC(memory access cost ),并行度(degree of parallelis),数据的读取方式等等。(如何评价shufflenet V2?

继续阅读