天天看点

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

上一篇我们实现了利用mkldnn做最基本的卷积计算的代码

从mkldnn文档里的这篇文章 Understanding Memory Formats里描述了memory对象的几种基本内存排列格式和类型

我们目前基于memory::format_tag::nhwc定义的内存对象的数据在内存中排列方式如下图所示

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

但是mkldnn更推荐我们创建基于blocked layout的内存对象,据称这样更利于使用SIMD指令,例如SSE, AVX来做数据加速

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

从上个卷积程序运行的输出看也确实没有用到avx2指令来加速

mkldnn_verbose,exec,cpu,convolution,ref:any,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,alg:convolution_direct,mb1_ic2oc2_ih5oh5kh3sh1dh0ph1_iw5ow5kw3sw1dw0pw1,0.526223

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

下面我们来研究一下快速卷积计算,mkl-dnn的官方参考例程在这里cnn_inference_f32.cpp 

先复习一下前面文章里卷积的代码

// 源内存格式描述 NHWC
	auto user_src3_md = memory::desc(
		conv3_src_tz, // logical dims, the order is defined by a primitive
		memory::data_type::f32,     // tensor's data type
		memory::format_tag::nhwc    // memory format, NHWC in this case
	);

	// 权重内存格式描述 OIHW
	auto user_conv3_weights_md = memory::desc(
		conv3_weights_tz, memory::data_type::f32,
		memory::format_tag::oihw // 
	);
    // 偏置内存描述 1D数组,tag为x
	auto user_conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::x);

    // 目标内存格式描述 NHWC
	auto user_dst3_md = memory::desc(
		conv3_dst_tz, // logical dims, the order is defined by a primitive
		memory::data_type::f32,     // tensor's data type
		memory::format_tag::nhwc    // memory format, NHWC in this case
	);

	// create user memory
	auto user_conv3_src_mem = memory(user_src3_md, cpu_engine, image.data());
	auto user_conv3_weights_mem = memory(user_conv3_weights_md, cpu_engine, weights.data());
	auto user_conv3_bias_mem = memory(user_conv3_bias_md, cpu_engine, bias.data());
	// For dst_mem the library allocates buffer
	auto user_conv3_dst_mem = memory(user_dst3_md, cpu_engine);  //for conv output
	auto user_conv3_dst1_mem = memory(user_dst3_md, cpu_engine);  //for conv output

    //卷积操作的描述,传进去的参数定义了输入各个内存对象的格式 
    //src->NHWC weights->OIHW bias->x dst->NHWC
    //所以最终卷积计算时要按照各个内存排列格式来进行运算
	auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,
		algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,
		user_conv3_bias_md,
		user_dst3_md, conv3_strides, conv3_padding,
		conv3_padding);
	auto conv3_pd = convolution_forward::primitive_desc(conv3_d, cpu_engine);


	// create convolution primitive and add it to net
	auto conv3 = convolution_forward(conv3_pd);

	conv3.execute(
		cpu_stream,
		{
			{ MKLDNN_ARG_SRC, user_conv3_src_mem },
			{ MKLDNN_ARG_WEIGHTS, user_conv3_weights_mem },
			{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
			{ MKLDNN_ARG_DST, user_conv3_dst_mem }
		}
	);
           

 最关键的是这句

    auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,

        algorithm::convolution_direct, user_src3_md, user_conv3_weights_md,

        user_conv3_bias_md,

        user_dst3_md, conv3_strides, conv3_padding,

        conv3_padding);

    卷积操作的描述,传进去的参数定义了输入各个内存对象的格式: src->NHWC weights->OIHW bias->x dst->NHWC

    所以最终卷积计算时要严格按照各个内存排列格式来进行运算,这样MKLDNN库计算时就失去了灵活性。

而mkl-dnn的官方例程在这里cnn_inference_f32.cpp 有一些不同,把代码归纳了一下,

//声明卷积操作用到的各个内存对象的格式,不同的是所有内存格式均为memory::format_tag::any
    //即我们不指定各个内存对象的格式,由mkldnn在运行时根据当前CPU硬件的特性动态确定最佳内存排列格式
	auto conv3_src_md = memory::desc({ conv3_src_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_weights_md = memory::desc({ conv3_weights_tz }, memory::data_type::f32, memory::format_tag::any);
	auto conv3_dst_md = memory::desc({ conv3_dst_tz }, memory::data_type::f32, memory::format_tag::any);

	//创建卷积操作的描述
	auto conv3_fast_desc = convolution_forward::desc(prop_kind::forward_inference,
		algorithm::convolution_direct, conv3_src_md, conv3_weights_md,
		conv3_bias_md, conv3_dst_md, conv3_strides, conv3_padding,
		conv3_padding);

	//创建卷积操作原语描述
	auto conv3_fast_prim_desc = convolution_forward::primitive_desc(conv3_fast_desc, cpu_engine);

	//从卷积操作原语描述对象中获取源内存区,权重内存区的最佳内存排列格式,如果和我们自己创建的内存对象格式不一样,则做一个内存数据的重排列(reorder)
	auto conv3_src_memory = user_conv3_src_mem;
	if (conv3_fast_prim_desc.src_desc() != user_conv3_src_mem.get_desc()) {
		conv3_src_memory = memory(conv3_fast_prim_desc.src_desc(), cpu_engine);
		reorder(user_conv3_src_mem, conv3_src_memory)
			.execute(cpu_stream, user_conv3_src_mem, conv3_src_memory);
	}
	auto conv3_weights_memory = user_conv3_weights_mem;
	if (conv3_fast_prim_desc.weights_desc() != user_conv3_weights_mem.get_desc()) {
		conv3_weights_memory = memory(conv3_fast_prim_desc.weights_desc(), cpu_engine);
		reorder(user_conv3_weights_mem, conv3_weights_memory)
			.execute(cpu_stream, user_conv3_weights_mem, conv3_weights_memory);
	}

	//从卷积操作原语描述对象中获取最佳目标内存对象的排列格式,并且创建目标内存对象
	auto conv3_dst_memory = memory(conv3_fast_prim_desc.dst_desc(), cpu_engine);
	//[Create memory for output]
	auto fast_conv3 = convolution_forward(conv3_fast_prim_desc);

	fast_conv3.execute(
		cpu_stream,
		{
			{ MKLDNN_ARG_SRC, conv3_src_memory },
			{ MKLDNN_ARG_WEIGHTS, conv3_weights_memory },
			{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
			{ MKLDNN_ARG_DST, conv3_dst_memory }
		}
	);

    //如果mkldnn输出的目标内存对象和我们自定义的目标内存对象格式不符,做一个reorder操作
	if (conv3_dst_memory != user_conv3_dst1_mem) {
		reorder(conv3_dst_memory, user_conv3_dst1_mem)
			.execute(cpu_stream, conv3_dst_memory, user_conv3_dst1_mem);
	}

	// 等待所有操作结束
	cpu_stream.wait();
           

可以看到,新的操作序列里不硬性指定卷积Conv操作所需的内存对象的数据格式,改由mkldnn根据当前硬件特性自己定义,程序根据mkldnn反馈的最佳内存数据格式做相应的重排序reorder,这样可以做计算的时候充分利用硬件的性能。这样我们增加了3次reorder的时间换取了计算时候的最佳效率 (计算前2次reorder, 计算后1次reorder, bias是1维数组,所以实际上是不需要reorder的)

我们对比一下快速卷积和普通卷积的性能对比

针对前一个程序的源NCHW为1x2x5x5, kernel为2x2x3x3的情况

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

普通卷积为0.504342ms

快速卷积为 0.00401139+ 0.0196924+ 0.0324559 + 0.016775 = 0.07293469ms

速度大概提升了7倍,后面还打印了2次计算的结果,数据是完全一样的

下面看看一个接近真实情况的大卷积的数据

输入尺寸为

const int N = 1, H = 64, W = 64, C = 64;

const int IC = C, OC = IC, KH = 3, KW = 3;

MKL-DNN学习笔记 (五) 实现Conv层的快速计算

普通卷积为1120.01ms

快速卷积为 0.393482+ 0.0729345+ 1.83284+ 0.652764= 2.9520205ms

可以看到convolution时 用了jit:avx2

mkldnn_verbose,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:aBcd8b:f0 wei_f32::blocked:ABcd8b8a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd8b:f0,alg:convolution_direct,mb1_ic64oc64_ih64oh64kh3sh1dh0ph1_iw64ow64kw3sw1dw0pw1,1.83284

性能相对提升了379倍,爽!!!

最后代码奉上,仅供参考

https://github.com/tisandman555/mkldnn_study/blob/master/fast_conv.cpp

继续阅读