ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

作者Jongsoo Park，Sheng Li等，From Intel Labs

Abstract

The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits.CNN越來越重要
之前的算法都是prunning FC層，而對速度影響最大的conv層則沒有prune。
We present a method to realize simultaneously size economy and speed improvement while pruning CNNs.于是作者提出一種conv層prunning的算法
our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns.這是一種高效的dense matrix和sparse matrix 做multiplication的算法，能夠應用于所有dense 的feature map和sparse的kernel之間conv操作
Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures.并且提出一種評價model，來預測不同architechtures和不同layer的sparse程度的最佳值。
Together, these two allow us to demonstrate 3.1–7.3× convolution speedups over dense convolution in AlexNet.論文在AlexNet上的conv層進行壓縮，獲得了3.1~7.3倍的加速。
項目位址：https://github.com/IntelLabs/SkimCaffe.

Introduction

的paperPost-processing may consist of retraining with sparsity inducing regularization or of approximating tensors of parameters via tensor factorization. Han et al. (2015; 2016b）的paper中後處理操作：稀疏化、張量分解。但沒有考慮CNN的prunning
The crux of speed improvement thus lie in actual fast convolution of sparse kernels with feature maps. conv層壓縮的關鍵就是dense的feature map和稀疏的kernal之間高效的conv操作。
the performance of sparse matrix operations is typically memory bandwidth bound.
(Lebedev & Lempitsky, 2015; Wen et al., 2016)提出了一種group-wise sparsity patterns
Convolutions in CNNs involve multiple channels and thus offer much higher data reuse than typical sparse matrix operations in scientific computing.與之前的paper不同，本文作者從另一種角度來看Conv壓縮問題：Conv計算産生了大量的資料重用的計算。
Specifically, we present a highly efficient direct sparse convolution design formulated as sparse-matrix-dense-matrix multiplication with the dense matrix columns generated on-the-fly from a single column vector. 随着column vector，也就是矩陣不停的産生dense matrix columns，作者提出了一種高效的sparse-dense mutiplication
we formulate a performance model to elucidate when and how best to use sparse convolutions on different computer architectures and at different CNN layers. 此外還提出了一種預測最佳sparse程度的model
本文算法稱為Guided Sparsity Learning，GPL。
In some cases, particular layers are identified as best not pruned at all due to no potential speedups, leaving them unchanged gives other layers more room for gainful pruning in terms of size and speed. 有些情況下，某些層會被預測為不用prunning，這在某種程度上為了其他層的更高效的提速提供了空間。

本文的貢獻

A high performance sparse convolution design that takes advantage of arbitrary sparsity patterns and outperforms dense convolution even with a moderate sparsity.
A general performance model that (1) projects speedups over dense convolutions on varying level/types of sparsity and different computing platforms and (2) provides training guidelines for precisely targeting layers and sparsity ranges with potential to accelerate inference.

-Guided Sparsity Learning (GSL), the first pruning algorithm fusing the awareness of speedup potential into sparsity learning; and its application to AlexNet and GoogLeNet. In particular, in GoogLeNet, we prune out more than 80% of parameters of all 5×5/3×3 conv layers and fc layers with no accuracy drop.

-An optimized sparse convolution implementation (http://github.com/IntelLabs/SkimCaffe) that provides 7.3×, 3.4×, 3.1× speedups of convolution layers in AlexNet over dense methods on Intel Atom, Xeon, and Knights Landing processors, respectively, with no accuracy drop. In particular, this paper is one of the first evaluations of Xeon Phi processors on deep learning algorithms.

前面都有提到。

算法介紹

DIRECT SPARSE CONVOLUTION

-A sparse convolution for the all output positions across all output channels can be eventually considered as a virtual sparse-matrix-dense-matrix multiplication. 稀疏卷積可以看作抽象的sparse-matrix-dense-matrix multiplication

-假設有N個filters，每個的size為R × S. 輸入為C×H×W，于是這層的參數為一個N × C × R × S的張量，input是C × H(in)× W(in)的張量，output為C × H(out)× W(out)的張量，如下：兩個三維張量的點積

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

-這個操作可以被看成首先根據channnels向量化4維的blob，然後再向量化kernel最後再按照二維點乘的方式計算。first vectorizing the 3D subtensor of W corresponding to the nth output channel, then vectorizing I (denoted as vec( I )), and finally stretching the first vector to match the dimension of two vectors.當W稀疏時，就變成了一個sparse-vector-dense-vector點乘。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

- Consider flattening dimensions except the first one of W into a sparse matrix W (1) (i.e. mode-1 matricization of W as in Kolda & Bader (2009)), with its row vectors stretched to match the dimension of vec( I ). 将W按第一維展平，得稀疏矩陣W (1) ，并且row vectors 和vec( I )的維數比對。

-O (n, y, x) is then the dot-product between the nth row of W (1) and vec( I ).

-Subsequently, the values at the same given (y, x)th position of all N output channels can be computed collectively as a sparse-matrix-dense-vector multiplication (SpMV):于是每個點(x,y)處的卷積結果可以計算多個sparse-matrix-dense-vector multiplication 來得到。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

-where I y,x denotes the tensor I with its last two dimensions shifted by (y, x). I y,x代表張量I根據(y,x)變化得到的向量化結果。

-we operate with a virtual dense matrix, I(virtual) , where its columns are generated on the fly by adjusting indices through which we access vec( I ). 于是就可以定義一種虛拟的dense matrix，每個column都是快速且獨立生成的。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

-後面一段講FC層sparse加速就不看了

A evaluation model

-The performance of sparse convolution depends highly on the sparsity level of the weight tensor.

-to determine the appropriate target sparsity range for pruning and to project theoretical speedup for any given sparsity, using the roofline model.采用roofline model

-令FLOP的數量為C，the size of input and output activation tensors as S(A) (in Bytes)，輸入和輸出的張量size為S(A)(Byte格式)，W的size為S(W)，均沒有考慮sparsity。

-We denote the density of non-zero in filters as x (the lower the x, the higher the sparsity of

weight tensor), the compute capability of processor as F (in FLOP/s), and the memory bandwidth as B (in B/s).令filter中非零元素的密度為x，計算速度為F(FLOP每秒)，memory bandwidth 為B(B/s)，t(dense)代表he time for dense convolution，t(sparse_compute)代表the time for sparse convolution bound by compute，by bandwidth為t(sparse_bw)，那麼理論上的速度提升可由如下表達式計算：

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

-where α and β denote the compute and storage overheads of sparse representations, respectively。

-We observe α ∼ 3 on a Xeon E5-2697 v4 processor, and β is typically 2 (in compressed sparse row representation, we need 4B column index for each 4B single-precision floating point non-zero value)

-there is an upper bound of useful sparsity, and a sparsity higher than it does not provide additional speedup, while only making training more challenging to preserve accuracy. 有效的稀疏有上界

-This upper bound can be found by solving for x such that t(sparse_compute)= t(sparse_bw)(e.g.,the upper bound sparsity for conv5 of AlexNet on the Xeon is x ∼ 0.02).上界可解t(sparse_compute)= t(sparse_bw)得到，AlexNet on Xeon 的上界為x ∼ 0.02。

-The speedup to sparsity relation also varies over layers. For example, since 1×1 convolutions in GoogLeNet has low arithmetic intensity to begin with, its performance quickly becomes bandwidth bound at lower sparsity (or higher x).每層對稀疏度的敏感程度也有不同。

-there is a lower bound of useful sparsity such that, with a sparsity lower than that, sparse convolution becomes slower than dense convolution.稀疏度還有下界，稀疏度太低速度反而不及dense conv，Since t (sparse_compute) > t(dense) for x > 1/α.

-The previous section described our sparse convolution implementation that achieves α=3 (since α is the compute overhead, lower is better) on the Xeon instead of α=100 as conjectured by Szegedy et al. (2015) 2 .

GUIDED SPARSITY LEARNING (GSL)

-For example, layers like the first layer in AlexNet and GoogLeNet may not provide enough sparsity regardless of the amount of regularization applied as long as the original inference accuracy is to be preserved. 淺層特征很重要，是以在第一層一般不會進行太多稀疏化。

-When GSL is used with element-wise regularization for pruning, thus denoted as Guided Element-wise Sparsity Learning (GESL), it learns the element-wise sparsity of layers where the model predicts speedups.

-GSL can also be used with regularization methods that are more complicated than basic ridge and lasso regularization. GSL能夠應用于更加複雜的regularization。

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

-Although GSL as described above aims primarily at inference speed, GSL can balance the implications of pruning on inference speed, accuracy, and model size. To do this, optional constraints can be given to GSL to prioritize pruning of different layers in the network. For example, by using different regularization strengths on conv and fc, we can tune the priorities on speed and model size. 雖然GSL aims at 前向傳播速度，但是他也可以維持剪枝所造成的速度/精度/model size變化之間的平衡性。要做到這一點，可以事先設定一些限制，例如對conv和fc采用不同強度的regularization，可以控制速度和model size 之間的優先性。

實驗部分

-待續

ICLR2017 paper: FASTER CNNS WITH DIRECT SPARSE CONVOLUTIONS AND GUIDED PRUNING 筆記

Abstract

Introduction

本文的貢獻

算法介紹

DIRECT SPARSE CONVOLUTION

A evaluation model

GUIDED SPARSITY LEARNING (GSL)

實驗部分

繼續閱讀

卷積神經網絡特征可視化存在的難點是什麼？

【Tensorflow】Tensorflow中的卷積函數（conv2d、slim.conv2d、depthwise_conv2d、conv2d_transpose）

CNN實作mnist手寫數字識别

基于CNN的MNIST手寫數字識别

Cascade R-CNN

卷積神經網絡中參數量的計算與模型占用顯存的計算

基于深度學習的車牌+車輛識别（YOLOv5和CNN）源碼加文末QQ基于深度學習的車牌識别(YOLOv5和CNN）目錄

torch.nn.Conv2d詳解

融合RNN和CNN的文本分類模型

CNN文本分類原理講解與實戰

Cascade and Fusion: A Deep Learning Approach for Camouflaged Object Sensing閱讀筆記一、簡介二、方法三、實驗

4.卷積神經網絡-第一周卷積神經網絡

基于PyTorch的卷積神經網絡圖像分類——貓狗大戰（一）：使用Pytorch定義DataLoader1. 需要用到的庫2. 資料擴充定義3. 自定義Dataset4. 測試

Pytorch學習筆記（8）———建構CNN網絡(下)

深度學習之卷積神經網絡(CNN) — 理論與代碼結合

深度學習之卷積神經網絡CNN及tensorflow代碼實作示例詳細介紹(轉載) 深度學習之卷積神經網絡CNN及tensorflow代碼實作示例詳細介紹