上一篇我們實作了利用mkldnn做最基本的卷積計算的代碼
從mkldnn文檔裡的這篇文章 Understanding Memory Formats裡描述了memory對象的幾種基本記憶體排列格式和類型
我們目前基于memory::format_tag::nhwc定義的記憶體對象的資料在記憶體中排列方式如下圖所示
但是mkldnn更推薦我們建立基于blocked layout的記憶體對象,據稱這樣更利于使用SIMD指令,例如SSE, AVX來做資料加速
從上個卷積程式運作的輸出看也确實沒有用到avx2指令來加速
mkldnn_verbose,exec,cpu,convolution,ref:any,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,alg:convolution_direct,mb1_ic2oc2_ih5oh5kh3sh1dh0ph1_iw5ow5kw3sw1dw0pw1,0.526223
下面我們來研究一下快速卷積計算,mkl-dnn的官方參考例程在這裡cnn_inference_f32.cpp
先複習一下前面文章裡卷積的代碼
// 源記憶體格式描述 NHWC
auto user_src3_md = memory::desc(
conv3_src_tz, // logical dims, the order is defined by a primitive
memory::data_type::f32, // tensor's data type
memory::format_tag::nhwc // memory format, NHWC in this case
);
// 權重記憶體格式描述 OIHW
auto user_conv3_weights_md = memory::desc(
conv3_weights_tz, memory::data_type::f32,
memory::format_tag::oihw //
);
// 偏置記憶體描述 1D數組,tag為x
auto user_conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::x);
// 目标記憶體格式描述 NHWC
auto user_dst3_md = memory::desc(
conv3_dst_tz, // logical dims, the order is defined by a primitive
memory::data_type::f32, // tensor's data type
memory::format_tag::nhwc // memory format, NHWC in this case
);
// create user memory
auto user_conv3_src_mem = memory(user_src3_md, cpu_engine, image.data());
auto user_conv3_weights_mem = memory(user_conv3_weights_md, cpu_engine, weights.data());
auto user_conv3_bias_mem = memory(user_conv3_bias_md, cpu_engine, bias.data());
// For dst_mem the library allocates buffer
auto user_conv3_dst_mem = memory(user_dst3_md, cpu_engine); //for conv output
auto user_conv3_dst1_mem = memory(user_dst3_md, cpu_engine); //for conv output
//卷積操作的描述,傳進去的參數定義了輸入各個記憶體對象的格式
//src->NHWC weights->OIHW bias->x dst->NHWC
//是以最終卷積計算時要按照各個記憶體排列格式來進行運算
auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,
algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,
user_conv3_bias_md,
user_dst3_md, conv3_strides, conv3_padding,
conv3_padding);
auto conv3_pd = convolution_forward::primitive_desc(conv3_d, cpu_engine);
// create convolution primitive and add it to net
auto conv3 = convolution_forward(conv3_pd);
conv3.execute(
cpu_stream,
{
{ MKLDNN_ARG_SRC, user_conv3_src_mem },
{ MKLDNN_ARG_WEIGHTS, user_conv3_weights_mem },
{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
{ MKLDNN_ARG_DST, user_conv3_dst_mem }
}
);
最關鍵的是這句
auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,
algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,
user_conv3_bias_md,
user_dst3_md, conv3_strides, conv3_padding,
conv3_padding);
卷積操作的描述,傳進去的參數定義了輸入各個記憶體對象的格式: src->NHWC weights->OIHW bias->x dst->NHWC
是以最終卷積計算時要嚴格按照各個記憶體排列格式來進行運算,這樣MKLDNN庫計算時就失去了靈活性。
而mkl-dnn的官方例程在這裡cnn_inference_f32.cpp 有一些不同,把代碼歸納了一下,
//聲明卷積操作用到的各個記憶體對象的格式,不同的是所有記憶體格式均為memory::format_tag::any
//即我們不指定各個記憶體對象的格式,由mkldnn在運作時根據目前CPU硬體的特性動态确定最佳記憶體排列格式
auto conv3_src_md = memory::desc({ conv3_src_tz }, memory::data_type::f32, memory::format_tag::any);
auto conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::any);
auto conv3_weights_md = memory::desc({ conv3_weights_tz }, memory::data_type::f32, memory::format_tag::any);
auto conv3_dst_md = memory::desc({ conv3_dst_tz }, memory::data_type::f32, memory::format_tag::any);
//建立卷積操作的描述
auto conv3_fast_desc = convolution_forward::desc(prop_kind::forward_inference,
algorithm::convolution_direct, conv3_src_md, conv3_weights_md,
conv3_bias_md, conv3_dst_md, conv3_strides, conv3_padding,
conv3_padding);
//建立卷積操作原語描述
auto conv3_fast_prim_desc = convolution_forward::primitive_desc(conv3_fast_desc, cpu_engine);
//從卷積操作原語描述對象中擷取源記憶體區,權重記憶體區的最佳記憶體排列格式,如果和我們自己建立的記憶體對象格式不一樣,則做一個記憶體資料的重排列(reorder)
auto conv3_src_memory = user_conv3_src_mem;
if (conv3_fast_prim_desc.src_desc() != user_conv3_src_mem.get_desc()) {
conv3_src_memory = memory(conv3_fast_prim_desc.src_desc(), cpu_engine);
reorder(user_conv3_src_mem, conv3_src_memory)
.execute(cpu_stream, user_conv3_src_mem, conv3_src_memory);
}
auto conv3_weights_memory = user_conv3_weights_mem;
if (conv3_fast_prim_desc.weights_desc() != user_conv3_weights_mem.get_desc()) {
conv3_weights_memory = memory(conv3_fast_prim_desc.weights_desc(), cpu_engine);
reorder(user_conv3_weights_mem, conv3_weights_memory)
.execute(cpu_stream, user_conv3_weights_mem, conv3_weights_memory);
}
//從卷積操作原語描述對象中擷取最佳目标記憶體對象的排列格式,并且建立目标記憶體對象
auto conv3_dst_memory = memory(conv3_fast_prim_desc.dst_desc(), cpu_engine);
//[Create memory for output]
auto fast_conv3 = convolution_forward(conv3_fast_prim_desc);
fast_conv3.execute(
cpu_stream,
{
{ MKLDNN_ARG_SRC, conv3_src_memory },
{ MKLDNN_ARG_WEIGHTS, conv3_weights_memory },
{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
{ MKLDNN_ARG_DST, conv3_dst_memory }
}
);
//如果mkldnn輸出的目标記憶體對象和我們自定義的目标記憶體對象格式不符,做一個reorder操作
if (conv3_dst_memory != user_conv3_dst1_mem) {
reorder(conv3_dst_memory, user_conv3_dst1_mem)
.execute(cpu_stream, conv3_dst_memory, user_conv3_dst1_mem);
}
// 等待所有操作結束
cpu_stream.wait();
可以看到,新的操作序列裡不硬性指定卷積Conv操作所需的記憶體對象的資料格式,改由mkldnn根據目前硬體特性自己定義,程式根據mkldnn回報的最佳記憶體資料格式做相應的重排序reorder,這樣可以做計算的時候充分利用硬體的性能。這樣我們增加了3次reorder的時間換取了計算時候的最佳效率 (計算前2次reorder, 計算後1次reorder, bias是1維數組,是以實際上是不需要reorder的)
我們對比一下快速卷積和普通卷積的性能對比
針對前一個程式的源NCHW為1x2x5x5, kernel為2x2x3x3的情況
普通卷積為0.504342ms
快速卷積為 0.00401139+ 0.0196924+ 0.0324559 + 0.016775 = 0.07293469ms
速度大概提升了7倍,後面還列印了2次計算的結果,資料是完全一樣的
下面看看一個接近真實情況的大卷積的資料
輸入尺寸為
const int N = 1, H = 64, W = 64, C = 64;
const int IC = C, OC = IC, KH = 3, KW = 3;
普通卷積為1120.01ms
快速卷積為 0.393482+ 0.0729345+ 1.83284+ 0.652764= 2.9520205ms
可以看到convolution時 用了jit:avx2
mkldnn_verbose,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:aBcd8b:f0 wei_f32::blocked:ABcd8b8a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd8b:f0,alg:convolution_direct,mb1_ic64oc64_ih64oh64kh3sh1dh0ph1_iw64ow64kw3sw1dw0pw1,1.83284
性能相對提升了379倍,爽!!!
最後代碼奉上,僅供參考
https://github.com/tisandman555/mkldnn_study/blob/master/fast_conv.cpp