【CUDA开发】 CUDA Thrust 规约求和

1. 使用 Thrust

Thrust 是一个开源的 C++ 库，用于开发高性能并行应用程序，以 C++ 标准模板库为蓝本实现。

官方文档见这里：CUDA Thrust

/* ... */

float *fMatrix_Device; // 指向设备显存

int iMatrixSize = iRow * iCol; // 矩阵元素个数

cudaMalloc((void**)&fMatrix_Device, iMatrixSize * sizeof(float)); // 在显存中为矩阵开辟空间

cudaMemcpy(fMatrix_Device, fMatrix_Host, iMatrixSize * sizeof(float), cudaMemcpyHostToDevice); // 将数据拷贝到显存

thrust::device_ptr<float> dev_ptr(fMatrix_Device);

float thrustResult = thrust::reduce(dev_ptr, dev_ptr + size_t(iMatrixSize), (float)0, thrust::plus<float>());

其中，fMatrix_Host 为指向主机内存的矩阵的头指针。

2. 我的 Reduction

/**

* 每个 warp 自动同步，不用 syncthreads();

* volatile : 加上关键字volatile的变量将被定义为敏感变量，意思是加了volatile

* 的变量在内存中的值可能会随时发生变化，当程序要去读取这个变量时，

必须要从内存中读取，而不是从缓存中读取

* sdata 数组头指针，数组位于共享内存

* tid 线程索引

/

device void warpReduce(volatile float sdata, int tid)

{

sdata[tid] += sdata[tid + 32];

sdata[tid] += sdata[tid + 16];

sdata[tid] += sdata[tid + 8];

sdata[tid] += sdata[tid + 4];

sdata[tid] += sdata[tid + 2];

sdata[tid] += sdata[tid + 1];

}

* 优化：解决了 reduce3 中存在的多余同步操作（每个warp默认自动同步）。

* globalInputData 输入数据，位于全局内存

* globalOutputData 输出数据，位于全局内存

global void reduce4(float globalInputData, float globalOutputData, unsigned int n)

shared__ float sdata[BLOCK_SIZE];

// 坐标索引

unsigned int tid = threadIdx.x;

unsigned int index = blockIdx.x(blockDim.x 2) + threadIdx.x;

unsigned int indexWithOffset = index + blockDim.x;

if (index >= n) sdata[tid] = 0;

else if (indexWithOffset >= n) sdata[tid] = globalInputData[index];

else sdata[tid] = globalInputData[index] + globalInputData[indexWithOffset];

syncthreads();

// 在共享内存中对每一个块进行规约计算

for (unsigned int s = blockDim.x / 2; s>32; s >>= 1)

{

if (tid < s) sdata[tid] += sdata[tid + s];

syncthreads();

}

if (tid < 32) warpReduce(sdata, tid);

// 把计算结果从共享内存写回全局内存

if (tid == 0) globalOutputData[blockIdx.x] = sdata[0];

* 计算 reduce4 函数的时间

* fMatrix_Host 矩阵头指针

* iRow 矩阵行数

* iCol 矩阵列数

* @return 和

float RuntimeOfReduce4(float fMatrix_Host, const int iRow, const int iCol)

float fReuslt = (float)malloc(sizeof(float));;

float fMatrix_Device; // 指向设备显存

int iMatrixSize = iRow * iCol; // 矩阵元素个数

cudaMalloc((void**)&fMatrix_Device, iMatrixSize * sizeof(float)); // 在显存中为矩阵开辟空间

cudaMemcpy(fMatrix_Device, fMatrix_Host, iMatrixSize * sizeof(float), cudaMemcpyHostToDevice); // 将数据拷贝到显存

/* ... /

for (int i = 1, int iNum = iMatrixSize; i < iMatrixSize; i = 2 i * BLOCK_SIZE)

int iBlockNum = (iNum + (2 * BLOCK_SIZE) - 1) / (2 * BLOCK_SIZE);

reduce4<<<iBlockNum, BLOCK_SIZE>>>(fMatrix_Device, fMatrix_Device, iNum);

iNum = iBlockNum;

cudaMemcpy(fReuslt, fMatrix_Device, sizeof(float), cudaMemcpyDeviceToHost); // 将数据拷贝到内存

cudaFree(fMatrix_Device);// 释放显存空间

return fReuslt[0];

上述程序是优化的最终版本，优化的主要内容包括：

1. 避免每个 Warp 中出现分支导致效率低下。

2. 减少取余操作。

3. 减小不必要的同步操作，每个warp都是默认同步的，不用额外的同步操作。

4. 减小线程的闲置，提高并行度

3. 时间对比

数据的大小为：

iRow = 1000;

iCol = 1000;

时间为：

ReduceThrust 的运行时间为：0.179968ms.

494497

Reduce0 的运行时间为：0.229152ms.

Reduce1 的运行时间为：0.134816ms.

Reduce2 的运行时间为：0.117504ms.

Reduce3 的运行时间为：0.086016ms.

Reduce4 的运行时间为：0.07424ms.

CPU的运行时间为：1 ms.

iRow = 2000;

iCol = 2000;

ReduceThrust 的运行时间为：0.282944ms.

1.97828e+006

Reduce0 的运行时间为：0.779776ms.

Reduce1 的运行时间为：0.42624ms.

Reduce2 的运行时间为：0.343744ms.

Reduce3 的运行时间为：0.217248ms.

Reduce4 的运行时间为：0.160416ms.

CPU的运行时间为：3 ms.

iRow = 4000;

iCol = 4000;

ReduceThrust 的运行时间为：0.536832ms.

7.91319e+006

Reduce0 的运行时间为：2.9919ms.

Reduce1 的运行时间为：1.56054ms.

Reduce2 的运行时间为：1.26618ms.

Reduce3 的运行时间为：0.726016ms.

Reduce4 的运行时间为：0.531712ms.

CPU的运行时间为：11 ms.

iRow = 6000;

iCol = 6000;

ReduceThrust 的运行时间为：0.988992ms.

1.7807e+007

Reduce4 的运行时间为：1.09286ms.

CPU的运行时间为：25 ms.

iRow = 11000;

iCol = 11000;

ReduceThrust 的运行时间为：2.9208ms.

5.98583e+007

Reduce4 的运行时间为：3.36998ms.

CPU的运行时间为：85 ms.

【CUDA开发】 CUDA Thrust 规约求和

1. 使用 Thrust

2. 我的 Reduction

上述程序是优化的最终版本，优化的主要内容包括：

3. 时间对比

继续阅读

5G小型蜂应用指南

UVA 116 Unidirectional TSP

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql优化

线程通信和进程通信区别（线程进程区别）

Matlab随机波动率SV、GARCH用MCMC马尔可夫链蒙特卡罗方法分析汇率时间序列

微信小程序前端解密获取用户信息

Spring MVC 自学杂记（五） -- SpringMVC与前台的json数据交互

《MySQL技术内幕：InnoDB存储引擎》笔记

扩容TIKV节点遇到的坑

PHP辅导代做编程：CS353 Database System

自学Zabbix3.10.2-事件通知Notifications upon events-Actions报警配置点击返回：自学zabbix集锦

HDU 5678 ztr loves trees

拓端tecdat|R语言弹性网络Elastic Net正则化惩罚回归模型交叉验证可视化

二叉树及其应用--二叉树创建

详解STM32单片机的堆栈