ShuffleNetwork
ShuffleNetwork:An Extremely Efficient Convolutional Neural Network for Mobile Devices
論文位址:ShuffleNetwork
摘要
介紹了一個名字為ShuffleNet計算效率極快的CNN架構,該架構設計用于計算能力有限的移動裝置。主要是使用了兩種操作:一個是pointwise group convolution,一個是channel shuffle。這兩種操作不但減小了計算成本還保持了原有的精度。實驗證明在ImageNet 分類中,shuffleNet的性能比其他先進的架構都強。在40MFLOPS的計算預支下,能夠在保持AlexNet的精度的同時還提高了将近13倍的速度。
介紹
目前的大趨勢主要是建構更深更大的卷積網絡來解決識别課題。但這些精度比較高的網絡都具有大量層和通道,需要幾百萬個的FLOPS要求。從另外一個角度上來考慮,在計算能力有限的移動裝置上來保持任務的精度。很多這方面的成果都是聚焦在pruning,compressing,或者low-bit representng a “basic”Network architecture。本文注重于設計一個能在想要達到的計算範圍之内的高效的基本架構。
文中使用pointwise group convolutions技術降低1x1卷積計算的複雜度;為了解決group covolution帶來的特征通道之間資訊不互通的負面效果,提出了channel shuffle的技術。基本這兩種計算搭建了一個名為ShuffleNet的高效架構。在已有的計算預支下,ShuffleNet 允許更多的feature map channels,這些channels能夠編碼出更多的資訊,尤其對于小網絡的性能而言是尤為重要的。
ShuffleNet的性能超過了目前很多的先進架構,比如在ImageNet的物體分類中的top-1 error上,比MobileNet低7.8%。并且在保持于AlexNet精度相當的同時,實際上能夠加速大約13倍(理論上18倍)。
相關工作
高效模型設計
近些年來,深度神經網絡在計算機視覺上取得了巨大的成果,其中模型設計充當了很重要的角色。由于嵌入式系統中高品質的深度神經網絡需求的增加激勵值高效模型的設計。比如GoogLeNet相比于為了提高網絡的深度而簡單的堆疊卷積層的架構減少了更多的參數;ResNet利用bottleneck 結構來實作了很好的性能,還有一個比較前言的研究是利用強化學習和模型搜尋來實作高效模型的設計。
群卷積(group convolution)
group convolution的概念是出現在AlexNet網絡為了把模型分布在兩塊GPU中的,并且在ResNetXt表現的也很好。深度可分卷積(depthwise separable convolution)衍生出了可分卷積(separable convolution)。MobileNet中使用了深度可分卷積(depthwise separable convolution)和獲得了先進的結果。文中把群卷積(ground convolution)和深度可分卷積(depthwise separable convolution)推廣成為了一種新的形式。
通道打亂操作(channel shuffle)
雖然在CNN的庫中存在“随機稀疏卷積”的層,但是在之前的工作中很少會提及到通道打亂這個操作。與我們研究的同時,也有人采用了這個思想用于兩個階段的卷積,但是他們沒有特意的去研究通道打亂的效用以及其在小型模型中的使用。
模型加速
這個方向的目的是在保持預先訓練的模型的準确性的同時加速預測。在預先訓練好的模型中,同時保持原性能,修剪網絡的連接配接或者通道來減少備援的連結。量化和分解可以減少計算中的備援來加快預測的深度。不修改模型的參數,利用FFT或者其他的方法來優化卷積算法,從來減少在實際中的時間消耗。DIstilling是用大網絡來遷移訓練小網絡,進而使得小網絡更加容易訓練。
方法
通道打亂用于群卷積
因為分組的群卷積對各個分組通道之間的資訊交流會造成阻礙,是以利用通道打亂的方法來幫助通道之間交流,進而讓網絡的表達能力提高了很多。
具體的打亂示意圖如下:
pointwise group convolution
文中的提出的逐點群卷積(pointwise group convolution)其實隻是融合了1x1卷積(pointwise convolution)和群卷積(group convolution),即pointwise group convolution = pointwise convolution + group convolution
并且提出了基于該兩項技術的ShuffleNet Units,具體的三種形式如下:
其中的GCconv就代表了pointwise group convolution 操作,具體的細節可以在後面的代碼中檢視。DWconv是在MobileNet中提出的depthwise separable convolution操作,該操作是由spatial convolution 和 pointwise convolution融合而成的。具體原理可以參考該部落格MobileNet算法。
ShuffleNet整體架構
基本上和ResNet是一樣的,也是分成幾個stage(ResNet中有4個stage,這裡隻有3個),然後在每個stage中用ShuffleNet unit代替原來的Residual block,這也就是ShuffleNet算法的核心。這個表是在限定complexity的情況下,通過改變group(g)的數量來改變output channel的數量,更多的output channel一般而言可以提取更多的特征。
代碼實作(keras)
from keras import backend as K
from keras.applications.imagenet_utils import _obtain_input_shape
from keras.models import Model
from keras.engine.topology import get_source_inputs
from keras.layers import Activation, Add, Concatenate, GlobalAveragePooling2D,GlobalMaxPooling2D, Input, Dense
from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D, BatchNormalization, Lambda
from keras.applications.mobilenet import DepthwiseConv2D
import numpy as np
def ShuffleNet(include_top=True, input_tensor=None, scale_factor=1.0, pooling='max',
input_shape=(224,224,3), groups=1, load_model=None, num_shuffle_units=[3, 7, 3],
bottleneck_ratio=0.25, classes=1000):
"""
ShuffleNet implementation for Keras 2
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun
https://arxiv.org/pdf/1707.01083.pdf
Note that only TensorFlow is supported for now, therefore it only works
with the data format `image_data_format='channels_last'` in your Keras
config at `~/.keras/keras.json`.
Parameters
----------
include_top: bool(True)
whether to include the fully-connected layer at the top of the network.
input_tensor:
optional Keras tensor (i.e. output of `layers.Input()`) to use as image input for the model.
scale_factor:
scales the number of output channels
input_shape:
pooling:
Optional pooling mode for feature extraction
when `include_top` is `False`.
- `None` means that the output of the model
will be the 4D tensor output of the
last convolutional layer.
- `avg` means that global average pooling
will be applied to the output of the
last convolutional layer, and thus
the output of the model will be a
2D tensor.
- `max` means that global max pooling will
be applied.
groups: int
number of groups per channel
num_shuffle_units: list([3,7,3])
number of stages (list length) and the number of shufflenet units in a
stage beginning with stage 2 because stage 1 is fixed
e.g. idx 0 contains 3 + 1 (first shuffle unit in each stage differs) shufflenet units for stage 2
idx 1 contains 7 + 1 Shufflenet Units for stage 3 and
idx 2 contains 3 + 1 Shufflenet Units
bottleneck_ratio:
bottleneck ratio implies the ratio of bottleneck channels to output channels.
For example, bottleneck ratio = 1 : 4 means the output feature map is 4 times
the width of the bottleneck feature map.
classes: int(1000)
number of classes to predict
Returns
-------
A Keras model instance
References
----------
- [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices]
(http://www.arxiv.org/pdf/1707.01083.pdf)
"""
if K.backend() != 'tensorflow':
raise RuntimeError('Only TensorFlow backend is currently supported, '
'as other backends do not support ')
name = "ShuffleNet_%.2gX_g%d_br_%.2g_%s" % (scale_factor, groups, bottleneck_ratio, "".join([str(x) for x in num_shuffle_units]))
input_shape = _obtain_input_shape(input_shape,
default_size=224,
min_size=28,
require_flatten=include_top,
data_format=K.image_data_format())
out_dim_stage_two = {1: 144, 2: 200, 3: 240, 4: 272, 8: 384}
if groups not in out_dim_stage_two:
raise ValueError("Invalid number of groups.")
if pooling not in ['max','avg']:
raise ValueError("Invalid value for pooling.")
if not (float(scale_factor) * 4).is_integer():
raise ValueError("Invalid value for scale_factor. Should be x over 4.")
##計算每一個stage中輸出通道的數目
exp = np.insert(np.arange(0, len(num_shuffle_units), dtype=np.float32), 0, 0)
out_channels_in_stage = 2 ** exp
out_channels_in_stage *= out_dim_stage_two[groups] # calculate output channels for each stage
out_channels_in_stage[0] = 24 # first stage has always 24 output channels
out_channels_in_stage *= scale_factor
out_channels_in_stage = out_channels_in_stage.astype(int)
#構模組化型的輸入
if input_tensor is None:
img_input = Input(shape=input_shape)
else:
if not K.is_keras_tensor(input_tensor):
img_input = Input(tensor=input_tensor, shape=input_shape)
else:
img_input = input_tensor
# create shufflenet architecture
##建構ShuffleNetwork架構
x = Conv2D(filters=out_channels_in_stage[0], kernel_size=(3, 3), padding='same',
use_bias=False, strides=(2, 2), activation="relu", name="conv1")(img_input)
x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same', name="maxpool1")(x)
##建構2-4stage的架構,總共3個blocks
# create stages containing shufflenet units beginning at stage 2
for stage in range(0, len(num_shuffle_units)):
repeat = num_shuffle_units[stage]
x = _block(x, out_channels_in_stage, repeat=repeat,
bottleneck_ratio=bottleneck_ratio,
groups=groups, stage=stage + 2)
##模型頂部的架構
if pooling == 'avg':
x = GlobalAveragePooling2D(name="global_pool")(x)
elif pooling == 'max':
x = GlobalMaxPooling2D(name="global_pool")(x)
if include_top:
x = Dense(units=classes, name="fc")(x)
x = Activation('softmax', name='softmax')(x)
if input_tensor is not None:
inputs = get_source_inputs(input_tensor)
else:
inputs = img_input
model = Model(inputs=inputs, outputs=x, name=name)
if load_model is not None:
model.load_weights('', by_name=True)
return model
##建構Shufflenet中的一個block
def _block(x, channel_map, bottleneck_ratio, repeat=1, groups=1, stage=1):
"""
creates a bottleneck block containing `repeat + 1` shuffle units
Parameters
----------
x:
Input tensor of with `channels_last` data format
channel_map: list
list containing the number of output channels for a stage
repeat: int(1)
number of repetitions for a shuffle unit with stride 1
groups: int(1)
number of groups per channel
bottleneck_ratio: float
在pointwise group convolution時輸入和輸出的通道數目比
bottleneck ratio implies the ratio of bottleneck channels to output channels.
For example, bottleneck ratio = 1 : 4 means the output feature map is 4 times
the width of the bottleneck feature map.
stage: int(1)
stage number
Returns
-------
"""
##除了兩個stage交替時需要使用concatenate外,其他的都是直接Add,
x = _shuffle_unit(x, in_channels=channel_map[stage - 2],
out_channels=channel_map[stage - 1], strides=2,
groups=groups, bottleneck_ratio=bottleneck_ratio,
stage=stage, block=1)
for i in range(1, repeat + 1):
x = _shuffle_unit(x, in_channels=channel_map[stage - 1],
out_channels=channel_map[stage - 1], strides=1,
groups=groups, bottleneck_ratio=bottleneck_ratio,
stage=stage, block=(i + 1))
return x
def _shuffle_unit(inputs, in_channels, out_channels, groups, bottleneck_ratio, strides=2, stage=1, block=1):
"""
creates a shuffleunit
Parameters
----------
inputs:
Input tensor of with `channels_last` data format
in_channels:
number of input channels
out_channels:
number of output channels
strides:
An integer or tuple/list of 2 integers,
specifying the strides of the convolution along the width and height.
groups: int(1)
number of groups per channel
bottleneck_ratio: float
bottleneck ratio implies the ratio of bottleneck channels to output channels.
For example, bottleneck ratio = 1 : 4 means the output feature map is 4 times
the width of the bottleneck feature map.
stage: int(1)
stage number
block: int(1)
block number
Returns
-------
"""
if K.image_data_format() == 'channels_last':
bn_axis = -1
else:
bn_axis = 1
prefix = 'stage%d/block%d' % (stage, block)
#if strides >= 2:
#out_channels -= in_channels
# default: 1/4 of the output channel of a ShuffleNet Unit
bottleneck_channels = int(out_channels * bottleneck_ratio)
##在stage1和stage2的交界處不用group convolution
groups = (1 if stage == 2 and block == 1 else groups)
x = _group_conv(inputs, in_channels, out_channels=bottleneck_channels,
groups=(1 if stage == 2 and block == 1 else groups),
name='%s/1x1_gconv_1' % prefix)
x = BatchNormalization(axis=bn_axis, name='%s/bn_gconv_1' % prefix)(x)
x = Activation('relu', name='%s/relu_gconv_1' % prefix)(x)
##利用Lambda層來實作channel shuffle層,
x = Lambda(channel_shuffle, arguments={'groups': groups}, name='%s/channel_shuffle' % prefix)(x)
x = DepthwiseConv2D(kernel_size=(3, 3), padding="same", use_bias=False,
strides=strides, name='%s/1x1_dwconv_1' % prefix)(x)
x = BatchNormalization(axis=bn_axis, name='%s/bn_dwconv_1' % prefix)(x)
x = _group_conv(x, bottleneck_channels, out_channels=out_channels if strides == 1 else out_channels - in_channels,
groups=groups, name='%s/1x1_gconv_2' % prefix)
x = BatchNormalization(axis=bn_axis, name='%s/bn_gconv_2' % prefix)(x)
##使用不同的stride來判斷是concatenate還是add
if strides < 2:
ret = Add(name='%s/add' % prefix)([x, inputs])
else:
avg = AveragePooling2D(pool_size=3, strides=2, padding='same', name='%s/avg_pool' % prefix)(inputs)
ret = Concatenate(bn_axis, name='%s/concat' % prefix)([x, avg])
ret = Activation('relu', name='%s/relu_out' % prefix)(ret)
return ret
##使用slice的操作來實作group convolution,最後再concatenate
def _group_conv(x, in_channels, out_channels, groups, kernel=1, stride=1, name=''):
"""
grouped convolution
Parameters
----------
x:
Input tensor of with `channels_last` data format
in_channels:
number of input channels
out_channels:
number of output channels
groups:
number of groups per channel
kernel: int(1)
An integer or tuple/list of 2 integers, specifying the
width and height of the 2D convolution window.
Can be a single integer to specify the same value for
all spatial dimensions.
stride: int(1)
An integer or tuple/list of 2 integers,
specifying the strides of the convolution along the width and height.
Can be a single integer to specify the same value for all spatial dimensions.
name: str
A string to specifies the layer name
Returns
-------
"""
if groups == 1:
return Conv2D(filters=out_channels, kernel_size=kernel, padding='same',
use_bias=False, strides=stride, name=name)(x)
# number of intput channels per group
ig = in_channels // groups
group_list = []
assert out_channels % groups == 0
for i in range(groups):
offset = i * ig
group = Lambda(lambda z: z[:, :, :, offset: offset + ig], name='%s/g%d_slice' % (name, i))(x)
group_list.append(Conv2D(int(0.5 + out_channels / groups), kernel_size=kernel, strides=stride,
use_bias=False, padding='same', name='%s_/g%d' % (name, i))(group))
return Concatenate(name='%s/concat' % name)(group_list)
##利用論文中說到的轉置來實作打亂的操作。
def channel_shuffle(x, groups):
"""
Parameters
----------
x:
Input tensor of with `channels_last` data format
groups: int
number of groups per channel
Returns
-------
channel shuffled output tensor
Examples
--------
Example for a 1D Array with 3 groups
>>> d = np.array([0,1,2,3,4,5,6,7,8])
>>> x = np.reshape(d, (3,3))
>>> x = np.transpose(x, [1,0])
>>> x = np.reshape(x, (9,))
'[0 1 2 3 4 5 6 7 8] --> [0 3 6 1 4 7 2 5 8]'
"""
height, width, in_channels = x.shape.as_list()[1:]
channels_per_group = in_channels // groups
x = K.reshape(x, [-1, height, width, groups, channels_per_group])
x = K.permute_dimensions(x, (0, 1, 2, 4, 3)) # transpose
x = K.reshape(x, [-1, height, width, in_channels])
return x