天天看点

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

作者Jongsoo Park,Sheng Li等,From Intel Labs

Abstract

  • The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits.CNN越来越重要
  • 之前的算法都是prunning FC层,而对速度影响最大的conv层则没有prune。
  • We present a method to realize simultaneously size economy and speed improvement while pruning CNNs.于是作者提出一种conv层prunning的算法
  • our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns.这是一种高效的dense matrix和sparse matrix 做multiplication的算法,能够应用于所有dense 的feature map和sparse的kernel之间conv操作
  • Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures.并且提出一种评价model,来预测不同architechtures和不同layer的sparse程度的最佳值。
  • Together, these two allow us to demonstrate 3.1–7.3× convolution speedups over dense convolution in AlexNet.论文在AlexNet上的conv层进行压缩,获得了3.1~7.3倍的加速。
  • 项目地址:https://github.com/IntelLabs/SkimCaffe.

Introduction

  • 的paperPost-processing may consist of retraining with sparsity inducing regularization or of approximating tensors of parameters via tensor factorization. Han et al. (2015; 2016b)的paper中后处理操作:稀疏化、张量分解。但没有考虑CNN的prunning
  • The crux of speed improvement thus lie in actual fast convolution of sparse kernels with feature maps. conv层压缩的关键就是dense的feature map和稀疏的kernal之间高效的conv操作。
  • the performance of sparse matrix operations is typically memory bandwidth bound.
  • (Lebedev & Lempitsky, 2015; Wen et al., 2016)提出了一种group-wise sparsity patterns
  • Convolutions in CNNs involve multiple channels and thus offer much higher data reuse than typical sparse matrix operations in scientific computing.与之前的paper不同,本文作者从另一种角度来看Conv压缩问题:Conv计算产生了大量的数据重用的计算。
  • Specifically, we present a highly efficient direct sparse convolution design formulated as sparse-matrix-dense-matrix multiplication with the dense matrix columns generated on-the-fly from a single column vector. 随着column vector,也就是矩阵不停的产生dense matrix columns,作者提出了一种高效的sparse-dense mutiplication
  • we formulate a performance model to elucidate when and how best to use sparse convolutions on different computer architectures and at different CNN layers. 此外还提出了一种预测最佳sparse程度的model
  • 本文算法称为Guided Sparsity Learning,GPL。
  • In some cases, particular layers are identified as best not pruned at all due to no potential speedups, leaving them unchanged gives other layers more room for gainful pruning in terms of size and speed. 有些情况下,某些层会被预测为不用prunning,这在某种程度上为了其他层的更高效的提速提供了空间。

本文的贡献

  • A high performance sparse convolution design that takes advantage of arbitrary sparsity patterns and outperforms dense convolution even with a moderate sparsity.
  • A general performance model that (1) projects speedups over dense convolutions on varying level/types of sparsity and different computing platforms and (2) provides training guidelines for precisely targeting layers and sparsity ranges with potential to accelerate inference.

    -Guided Sparsity Learning (GSL), the first pruning algorithm fusing the awareness of speedup potential into sparsity learning; and its application to AlexNet and GoogLeNet. In particular, in GoogLeNet, we prune out more than 80% of parameters of all 5×5/3×3 conv layers and fc layers with no accuracy drop.

    -An optimized sparse convolution implementation (http://github.com/IntelLabs/SkimCaffe) that provides 7.3×, 3.4×, 3.1× speedups of convolution layers in AlexNet over dense methods on Intel Atom, Xeon, and Knights Landing processors, respectively, with no accuracy drop. In particular, this paper is one of the first evaluations of Xeon Phi processors on deep learning algorithms.

    前面都有提到。

算法介绍

DIRECT SPARSE CONVOLUTION

-A sparse convolution for the all output positions across all output channels can be eventually considered as a virtual sparse-matrix-dense-matrix multiplication. 稀疏卷积可以看作抽象的sparse-matrix-dense-matrix multiplication

-假设有N个filters,每个的size为R × S. 输入为C×H×W,于是这层的参数为一个N × C × R × S的张量,input是C × H(in)× W(in)的张量,output为C × H(out)× W(out)的张量,如下:两个三维张量的点积

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

-这个操作可以被看成首先根据channnels向量化4维的blob,然后再向量化kernel最后再按照二维点乘的方式计算。first vectorizing the 3D subtensor of W corresponding to the nth output channel, then vectorizing I (denoted as vec( I )), and finally stretching the first vector to match the dimension of two vectors.当W稀疏时,就变成了一个sparse-vector-dense-vector点乘。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

- Consider flattening dimensions except the first one of W into a sparse matrix W (1) (i.e. mode-1 matricization of W as in Kolda & Bader (2009)), with its row vectors stretched to match the dimension of vec( I ). 将W按第一维展平,得稀疏矩阵W (1) ,并且row vectors 和vec( I )的维数匹配。

-O (n, y, x) is then the dot-product between the nth row of W (1) and vec( I ).

-Subsequently, the values at the same given (y, x)th position of all N output channels can be computed collectively as a sparse-matrix-dense-vector multiplication (SpMV):于是每个点(x,y)处的卷积结果可以计算多个sparse-matrix-dense-vector multiplication 来得到。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

-where I y,x denotes the tensor I with its last two dimensions shifted by (y, x). I y,x代表张量I根据(y,x)变化得到的向量化结果。

-we operate with a virtual dense matrix, I(virtual) , where its columns are generated on the fly by adjusting indices through which we access vec( I ). 于是就可以定义一种虚拟的dense matrix,每个column都是快速且独立生成的。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

-后面一段讲FC层sparse加速就不看了

A evaluation model

-The performance of sparse convolution depends highly on the sparsity level of the weight tensor.

-to determine the appropriate target sparsity range for pruning and to project theoretical speedup for any given sparsity, using the roofline model.采用roofline model

-令FLOP的数量为C,the size of input and output activation tensors as S(A) (in Bytes),输入和输出的张量size为S(A)(Byte格式),W的size为S(W),均没有考虑sparsity。

-We denote the density of non-zero in filters as x (the lower the x, the higher the sparsity of

weight tensor), the compute capability of processor as F (in FLOP/s), and the memory bandwidth as B (in B/s).令filter中非零元素的密度为x,计算速度为F(FLOP每秒),memory bandwidth 为B(B/s),t(dense)代表he time for dense convolution,t(sparse_compute)代表the time for sparse convolution bound by compute,by bandwidth为t(sparse_bw),那么理论上的速度提升可由如下表达式计算:

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

-where α and β denote the compute and storage overheads of sparse representations, respectively。

-We observe α ∼ 3 on a Xeon E5-2697 v4 processor, and β is typically 2 (in compressed sparse row representation, we need 4B column index for each 4B single-precision floating point non-zero value)

-there is an upper bound of useful sparsity, and a sparsity higher than it does not provide additional speedup, while only making training more challenging to preserve accuracy. 有效的稀疏有上界

-This upper bound can be found by solving for x such that t(sparse_compute)= t(sparse_bw)(e.g.,the upper bound sparsity for conv5 of AlexNet on the Xeon is x ∼ 0.02).上界可解t(sparse_compute)= t(sparse_bw)得到,AlexNet on Xeon 的上界为x ∼ 0.02。

-The speedup to sparsity relation also varies over layers. For example, since 1×1 convolutions in GoogLeNet has low arithmetic intensity to begin with, its performance quickly becomes bandwidth bound at lower sparsity (or higher x).每层对稀疏度的敏感程度也有不同。

-there is a lower bound of useful sparsity such that, with a sparsity lower than that, sparse convolution becomes slower than dense convolution.稀疏度还有下界,稀疏度太低速度反而不及dense conv,Since t (sparse_compute) > t(dense) for x > 1/α.

-The previous section described our sparse convolution implementation that achieves α=3 (since α is the compute overhead, lower is better) on the Xeon instead of α=100 as conjectured by Szegedy et al. (2015) 2 .

GUIDED SPARSITY LEARNING (GSL)

-For example, layers like the first layer in AlexNet and GoogLeNet may not provide enough sparsity regardless of the amount of regularization applied as long as the original inference accuracy is to be preserved. 浅层特征很重要,所以在第一层一般不会进行太多稀疏化。

-When GSL is used with element-wise regularization for pruning, thus denoted as Guided Element-wise Sparsity Learning (GESL), it learns the element-wise sparsity of layers where the model predicts speedups.

-GSL can also be used with regularization methods that are more complicated than basic ridge and lasso regularization. GSL能够应用于更加复杂的regularization。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 笔记

-Although GSL as described above aims primarily at inference speed, GSL can balance the implications of pruning on inference speed, accuracy, and model size. To do this, optional constraints can be given to GSL to prioritize pruning of different layers in the network. For example, by using different regularization strengths on conv and fc, we can tune the priorities on speed and model size. 虽然GSL aims at 前向传播速度,但是他也可以维持剪枝所造成的速度/精度/model size变化之间的平衡性。要做到这一点,可以事先设置一些限制,例如对conv和fc采用不同强度的regularization,可以控制速度和model size 之间的优先性。

实验部分

-待续

继续阅读