天天看點

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

上一篇我們實作了利用mkldnn做最基本的卷積計算的代碼

從mkldnn文檔裡的這篇文章 Understanding Memory Formats裡描述了memory對象的幾種基本記憶體排列格式和類型

我們目前基于memory::format_tag::nhwc定義的記憶體對象的資料在記憶體中排列方式如下圖所示

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

但是mkldnn更推薦我們建立基于blocked layout的記憶體對象,據稱這樣更利于使用SIMD指令,例如SSE, AVX來做資料加速

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

從上個卷積程式運作的輸出看也确實沒有用到avx2指令來加速

mkldnn_verbose,exec,cpu,convolution,ref:any,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,alg:convolution_direct,mb1_ic2oc2_ih5oh5kh3sh1dh0ph1_iw5ow5kw3sw1dw0pw1,0.526223

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

下面我們來研究一下快速卷積計算,mkl-dnn的官方參考例程在這裡cnn_inference_f32.cpp 

先複習一下前面文章裡卷積的代碼

// 源記憶體格式描述 NHWC
	auto user_src3_md = memory::desc(
		conv3_src_tz, // logical dims, the order is defined by a primitive
		memory::data_type::f32,     // tensor's data type
		memory::format_tag::nhwc    // memory format, NHWC in this case
	);

	// 權重記憶體格式描述 OIHW
	auto user_conv3_weights_md = memory::desc(
		conv3_weights_tz, memory::data_type::f32,
		memory::format_tag::oihw // 
	);
    // 偏置記憶體描述 1D數組,tag為x
	auto user_conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::x);

    // 目标記憶體格式描述 NHWC
	auto user_dst3_md = memory::desc(
		conv3_dst_tz, // logical dims, the order is defined by a primitive
		memory::data_type::f32,     // tensor's data type
		memory::format_tag::nhwc    // memory format, NHWC in this case
	);

	// create user memory
	auto user_conv3_src_mem = memory(user_src3_md, cpu_engine, image.data());
	auto user_conv3_weights_mem = memory(user_conv3_weights_md, cpu_engine, weights.data());
	auto user_conv3_bias_mem = memory(user_conv3_bias_md, cpu_engine, bias.data());
	// For dst_mem the library allocates buffer
	auto user_conv3_dst_mem = memory(user_dst3_md, cpu_engine);  //for conv output
	auto user_conv3_dst1_mem = memory(user_dst3_md, cpu_engine);  //for conv output

    //卷積操作的描述,傳進去的參數定義了輸入各個記憶體對象的格式 
    //src->NHWC weights->OIHW bias->x dst->NHWC
    //是以最終卷積計算時要按照各個記憶體排列格式來進行運算
	auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,
		algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,
		user_conv3_bias_md,
		user_dst3_md, conv3_strides, conv3_padding,
		conv3_padding);
	auto conv3_pd = convolution_forward::primitive_desc(conv3_d, cpu_engine);


	// create convolution primitive and add it to net
	auto conv3 = convolution_forward(conv3_pd);

	conv3.execute(
		cpu_stream,
		{
			{ MKLDNN_ARG_SRC, user_conv3_src_mem },
			{ MKLDNN_ARG_WEIGHTS, user_conv3_weights_mem },
			{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
			{ MKLDNN_ARG_DST, user_conv3_dst_mem }
		}
	);
           

 最關鍵的是這句

    auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,

        algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,

        user_conv3_bias_md,

        user_dst3_md, conv3_strides, conv3_padding,

        conv3_padding);

    卷積操作的描述,傳進去的參數定義了輸入各個記憶體對象的格式: src->NHWC weights->OIHW bias->x dst->NHWC

    是以最終卷積計算時要嚴格按照各個記憶體排列格式來進行運算,這樣MKLDNN庫計算時就失去了靈活性。

而mkl-dnn的官方例程在這裡cnn_inference_f32.cpp 有一些不同,把代碼歸納了一下,

//聲明卷積操作用到的各個記憶體對象的格式,不同的是所有記憶體格式均為memory::format_tag::any
    //即我們不指定各個記憶體對象的格式,由mkldnn在運作時根據目前CPU硬體的特性動态确定最佳記憶體排列格式
	auto conv3_src_md = memory::desc({ conv3_src_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_weights_md = memory::desc({ conv3_weights_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_dst_md = memory::desc({ conv3_dst_tz }, memory::data_type::f32, memory::format_tag::any);

	//建立卷積操作的描述
	auto conv3_fast_desc = convolution_forward::desc(prop_kind::forward_inference,
		algorithm::convolution_direct, conv3_src_md, conv3_weights_md,
		conv3_bias_md, conv3_dst_md, conv3_strides, conv3_padding,
		conv3_padding);

	//建立卷積操作原語描述
	auto conv3_fast_prim_desc = convolution_forward::primitive_desc(conv3_fast_desc, cpu_engine);

	//從卷積操作原語描述對象中擷取源記憶體區,權重記憶體區的最佳記憶體排列格式,如果和我們自己建立的記憶體對象格式不一樣,則做一個記憶體資料的重排列(reorder)
	auto conv3_src_memory = user_conv3_src_mem;
	if (conv3_fast_prim_desc.src_desc() != user_conv3_src_mem.get_desc()) {
		conv3_src_memory = memory(conv3_fast_prim_desc.src_desc(), cpu_engine);
		reorder(user_conv3_src_mem, conv3_src_memory)
			.execute(cpu_stream, user_conv3_src_mem, conv3_src_memory);
	}
	auto conv3_weights_memory = user_conv3_weights_mem;
	if (conv3_fast_prim_desc.weights_desc() != user_conv3_weights_mem.get_desc()) {
		conv3_weights_memory = memory(conv3_fast_prim_desc.weights_desc(), cpu_engine);
		reorder(user_conv3_weights_mem, conv3_weights_memory)
			.execute(cpu_stream, user_conv3_weights_mem, conv3_weights_memory);
	}

	//從卷積操作原語描述對象中擷取最佳目标記憶體對象的排列格式,并且建立目标記憶體對象
	auto conv3_dst_memory = memory(conv3_fast_prim_desc.dst_desc(), cpu_engine);
	//[Create memory for output]
	auto fast_conv3 = convolution_forward(conv3_fast_prim_desc);

	fast_conv3.execute(
		cpu_stream,
		{
			{ MKLDNN_ARG_SRC, conv3_src_memory },
			{ MKLDNN_ARG_WEIGHTS, conv3_weights_memory },
			{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
			{ MKLDNN_ARG_DST, conv3_dst_memory }
		}
	);

    //如果mkldnn輸出的目标記憶體對象和我們自定義的目标記憶體對象格式不符,做一個reorder操作
	if (conv3_dst_memory != user_conv3_dst1_mem) {
		reorder(conv3_dst_memory, user_conv3_dst1_mem)
			.execute(cpu_stream, conv3_dst_memory, user_conv3_dst1_mem);
	}

	// 等待所有操作結束
	cpu_stream.wait();
           

可以看到,新的操作序列裡不硬性指定卷積Conv操作所需的記憶體對象的資料格式,改由mkldnn根據目前硬體特性自己定義,程式根據mkldnn回報的最佳記憶體資料格式做相應的重排序reorder,這樣可以做計算的時候充分利用硬體的性能。這樣我們增加了3次reorder的時間換取了計算時候的最佳效率 (計算前2次reorder, 計算後1次reorder, bias是1維數組,是以實際上是不需要reorder的)

我們對比一下快速卷積和普通卷積的性能對比

針對前一個程式的源NCHW為1x2x5x5, kernel為2x2x3x3的情況

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

普通卷積為0.504342ms

快速卷積為 0.00401139+ 0.0196924+ 0.0324559 + 0.016775 = 0.07293469ms

速度大概提升了7倍,後面還列印了2次計算的結果,資料是完全一樣的

下面看看一個接近真實情況的大卷積的資料

輸入尺寸為

const int N = 1, H = 64, W = 64, C = 64;

const int IC = C, OC = IC, KH = 3, KW = 3;

MKL-DNN學習筆記 (五) 實作Conv層的快速計算

普通卷積為1120.01ms

快速卷積為 0.393482+ 0.0729345+ 1.83284+ 0.652764= 2.9520205ms

可以看到convolution時 用了jit:avx2

mkldnn_verbose,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:aBcd8b:f0 wei_f32::blocked:ABcd8b8a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd8b:f0,alg:convolution_direct,mb1_ic64oc64_ih64oh64kh3sh1dh0ph1_iw64ow64kw3sw1dw0pw1,1.83284

性能相對提升了379倍,爽!!!

最後代碼奉上,僅供參考

https://github.com/tisandman555/mkldnn_study/blob/master/fast_conv.cpp

繼續閱讀