【CUDA開發】 CUDA Thrust 規約求和

1. 使用 Thrust

Thrust 是一個開源的 C++ 庫，用于開發高性能并行應用程式，以 C++ 标準模闆庫為藍本實作。

官方文檔見這裡：CUDA Thrust

/* ... */

float *fMatrix_Device; // 指向裝置顯存

int iMatrixSize = iRow * iCol; // 矩陣元素個數

cudaMalloc((void**)&fMatrix_Device, iMatrixSize * sizeof(float)); // 在顯存中為矩陣開辟空間

cudaMemcpy(fMatrix_Device, fMatrix_Host, iMatrixSize * sizeof(float), cudaMemcpyHostToDevice); // 将資料拷貝到顯存

thrust::device_ptr<float> dev_ptr(fMatrix_Device);

float thrustResult = thrust::reduce(dev_ptr, dev_ptr + size_t(iMatrixSize), (float)0, thrust::plus<float>());

其中，fMatrix_Host 為指向主機記憶體的矩陣的頭指針。

2. 我的 Reduction

/**

* 每個 warp 自動同步，不用 syncthreads();

* volatile : 加上關鍵字volatile的變量将被定義為敏感變量，意思是加了volatile

* 的變量在記憶體中的值可能會随時發生變化，當程式要去讀取這個變量時，

必須要從記憶體中讀取，而不是從緩存中讀取

* sdata 數組頭指針，數組位于共享記憶體

* tid 線程索引

/

device void warpReduce(volatile float sdata, int tid)

{

sdata[tid] += sdata[tid + 32];

sdata[tid] += sdata[tid + 16];

sdata[tid] += sdata[tid + 8];

sdata[tid] += sdata[tid + 4];

sdata[tid] += sdata[tid + 2];

sdata[tid] += sdata[tid + 1];

}

* 優化：解決了 reduce3 中存在的多餘同步操作（每個warp預設自動同步）。

* globalInputData 輸入資料，位于全局記憶體

* globalOutputData 輸出資料，位于全局記憶體

global void reduce4(float globalInputData, float globalOutputData, unsigned int n)

shared__ float sdata[BLOCK_SIZE];

// 坐标索引

unsigned int tid = threadIdx.x;

unsigned int index = blockIdx.x(blockDim.x 2) + threadIdx.x;

unsigned int indexWithOffset = index + blockDim.x;

if (index >= n) sdata[tid] = 0;

else if (indexWithOffset >= n) sdata[tid] = globalInputData[index];

else sdata[tid] = globalInputData[index] + globalInputData[indexWithOffset];

syncthreads();

// 在共享記憶體中對每一個塊進行規約計算

for (unsigned int s = blockDim.x / 2; s>32; s >>= 1)

{

if (tid < s) sdata[tid] += sdata[tid + s];

syncthreads();

}

if (tid < 32) warpReduce(sdata, tid);

// 把計算結果從共享記憶體寫回全局記憶體

if (tid == 0) globalOutputData[blockIdx.x] = sdata[0];

* 計算 reduce4 函數的時間

* fMatrix_Host 矩陣頭指針

* iRow 矩陣行數

* iCol 矩陣列數

* @return 和

float RuntimeOfReduce4(float fMatrix_Host, const int iRow, const int iCol)

float fReuslt = (float)malloc(sizeof(float));;

float fMatrix_Device; // 指向裝置顯存

int iMatrixSize = iRow * iCol; // 矩陣元素個數

cudaMalloc((void**)&fMatrix_Device, iMatrixSize * sizeof(float)); // 在顯存中為矩陣開辟空間

cudaMemcpy(fMatrix_Device, fMatrix_Host, iMatrixSize * sizeof(float), cudaMemcpyHostToDevice); // 将資料拷貝到顯存

/* ... /

for (int i = 1, int iNum = iMatrixSize; i < iMatrixSize; i = 2 i * BLOCK_SIZE)

int iBlockNum = (iNum + (2 * BLOCK_SIZE) - 1) / (2 * BLOCK_SIZE);

reduce4<<<iBlockNum, BLOCK_SIZE>>>(fMatrix_Device, fMatrix_Device, iNum);

iNum = iBlockNum;

cudaMemcpy(fReuslt, fMatrix_Device, sizeof(float), cudaMemcpyDeviceToHost); // 将資料拷貝到記憶體

cudaFree(fMatrix_Device);// 釋放顯存空間

return fReuslt[0];

上述程式是優化的最終版本，優化的主要内容包括：

1. 避免每個 Warp 中出現分支導緻效率低下。

2. 減少取餘操作。

3. 減小不必要的同步操作，每個warp都是預設同步的，不用額外的同步操作。

4. 減小線程的閑置，提高并行度

3. 時間對比

資料的大小為：

iRow = 1000;

iCol = 1000;

時間為：

ReduceThrust 的運作時間為：0.179968ms.

494497

Reduce0 的運作時間為：0.229152ms.

Reduce1 的運作時間為：0.134816ms.

Reduce2 的運作時間為：0.117504ms.

Reduce3 的運作時間為：0.086016ms.

Reduce4 的運作時間為：0.07424ms.

CPU的運作時間為：1 ms.

iRow = 2000;

iCol = 2000;

ReduceThrust 的運作時間為：0.282944ms.

1.97828e+006

Reduce0 的運作時間為：0.779776ms.

Reduce1 的運作時間為：0.42624ms.

Reduce2 的運作時間為：0.343744ms.

Reduce3 的運作時間為：0.217248ms.

Reduce4 的運作時間為：0.160416ms.

CPU的運作時間為：3 ms.

iRow = 4000;

iCol = 4000;

ReduceThrust 的運作時間為：0.536832ms.

7.91319e+006

Reduce0 的運作時間為：2.9919ms.

Reduce1 的運作時間為：1.56054ms.

Reduce2 的運作時間為：1.26618ms.

Reduce3 的運作時間為：0.726016ms.

Reduce4 的運作時間為：0.531712ms.

CPU的運作時間為：11 ms.

iRow = 6000;

iCol = 6000;

ReduceThrust 的運作時間為：0.988992ms.

1.7807e+007

Reduce4 的運作時間為：1.09286ms.

CPU的運作時間為：25 ms.

iRow = 11000;

iCol = 11000;

ReduceThrust 的運作時間為：2.9208ms.

5.98583e+007

Reduce4 的運作時間為：3.36998ms.

CPU的運作時間為：85 ms.

【CUDA開發】 CUDA Thrust 規約求和

1. 使用 Thrust

2. 我的 Reduction

上述程式是優化的最終版本，優化的主要内容包括：

3. 時間對比

繼續閱讀

5G小型蜂應用指南

UVA 116 Unidirectional TSP

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql優化

線程通信和程序通信差別（線程程序差別）

Matlab随機波動率SV、GARCH用MCMC馬爾可夫鍊蒙特卡羅方法分析匯率時間序列

微信小程式前端解密擷取使用者資訊

Spring MVC 自學雜記（五） -- SpringMVC與前台的json資料互動

《MySQL技術内幕：InnoDB存儲引擎》筆記

擴容TIKV節點遇到的坑

PHP輔導代做程式設計：CS353 Database System

自學Zabbix3.10.2-事件通知Notifications upon events-Actions報警配置點選傳回：自學zabbix集錦

HDU 5678 ztr loves trees

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

二叉樹及其應用--二叉樹建立

詳解STM32單片機的堆棧