以cufftPlanMany為例FFT變換中embed,stride,dist的解釋與設定

關于FFT的自定義資料分布進行變換，之前每次都是用的寫demo，這次搞明白之後記錄一下，以便以後查閱。

比如需要對一個二維數組裡的每一行或者每一列進行傅裡葉變換，那麼需要對cufftPlanMany進行設定，然後進行批量處理。

cufftPlanMany的函數聲明如下

cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,
int istride, int idist, int *onembed, int ostride,int odist, cufftType type, int batch);

比如對于一個10*5的二維數組，想對每一行進行FFT：

以cufftPlanMany為例FFT變換中embed,stride,dist的解釋與設定

NX=10；NY=5；

rank=1，表明是進行一維傅裡葉變換；

n[1]，數組n次元與rank相同，表明每一維的變換數目，本例中n隻有一維，n[0]=NX，表示一維傅裡葉變換需要10個元素；

inembed[2]，表明原始輸入資料的次元，本例中輸入資料是一個二維數組，是以inembed次元為2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一個傅裡葉變換内部每一個元素相距的距離，本例中 istride=1；

idist，表明在兩個傅裡葉變換之間第一個元素相距的距離，本例中 idist=NX;

batch 表明多少個獨立的傅裡葉變換，本例中 batch=NY;

本例中onembed,ostride,odist做相同設定。

本例中我們進行C2C的傅裡葉變換，是以輸入和輸出格式一樣。實作代碼如下:

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NX;
	int istride=1;
	int idist = NX;
	int ostride=1;
	int odist = NX;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NY);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

實際輸出如下：

45.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

145.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

245.000000 0.000000 
-4.999997 15.388416 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624597 
-5.000000 0.000000 
-5.000000 -1.624597 
-5.000000 -3.632713 
-5.000000 -6.881910 
-4.999997 -15.388416 

345.000000 0.000000 
-4.999998 15.388418 
-4.999999 6.881909 
-4.999999 3.632712 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-4.999999 -3.632712 
-4.999999 -6.881909 
-4.999998 -15.388418 

445.000000 0.000000 
-4.999999 15.388418 
-5.000001 6.881911 
-5.000000 3.632714 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632714 
-5.000001 -6.881911 
-4.999999 -15.388418

對于上述資料每一列進行一次FFT來說：

以cufftPlanMany為例FFT變換中embed,stride,dist的解釋與設定

NX=10；NY=5；

rank=1，表明是進行一維傅裡葉變換；

n[1]，數組n次元與rank相同，表明每一維的變換數目，本例中n隻有一維，n[0]=NY，表示一維傅裡葉變換需要5個元素；

inembed[2]，表明原始輸入資料的次元，本例中輸入資料是一個二維數組，是以inembed次元為2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一個傅裡葉變換内部每一個元素相距的距離，本例中 istride= NX；

idist，表明在兩個傅裡葉變換之間第一個元素相距的距離，本例中 idist=1; （因為兩列傅裡葉變換第一個元素相鄰）

batch 表明多少個獨立的傅裡葉變換，本例中 batch=NX;

代碼如下：

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NY;
	int istride=NX;
	int idist = 1;
	int ostride=NX;
	int odist = 1;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NX);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

結果如下：

100.000000 0.000000 
105.000000 0.000000 
110.000000 0.000000 
115.000000 0.000000 
120.000000 0.000000 
125.000000 0.000000 
130.000000 0.000000 
135.000000 0.000000 
140.000000 0.000000 
145.000000 0.000000 

-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 

-25.000002 8.122991 
-25.000002 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 

-25.000002 -8.122991 
-25.000002 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 

-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550

另外文章參考了這些資源：

https://docs.nvidia.com/cuda/cufft/index.html#data-layout

https://rocfft.readthedocs.io/en/latest/real.html#setting-strides

更新：感謝評論中meng_zhi_xiang的提醒，關于inembed和onembed，針對我這種特殊情況（按行按列進行FFT）可以直接寫成資料大小。但是實際上如果資料視圖定義不僅僅如此簡單，inembed和onembed還需要重新設定。這段話的分析與解釋可以參考meng_zhi_xiang的第二段評論。

The inembed and onembed parameters define the number of elements in each dimension in the input array and the output array respectively. The inembed[rank-1] contains the number of elements in the least significant (innermost) dimension of the input data excluding the istride elements; the number of total elements in the least significant dimension of the input array is then istride*inembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant (that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides this information instead. Note that the size of each dimension of the transform should be less than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] ≤ inembed[i], n[i] ≤ onembed[i], where i ∈ { 0, … , rank − 1}.

另外官方提供了一個更複雜的stride與inembed應用的例子：

https://docs.nvidia.com/cuda/cufft/index.html#twod-advanced-data-layout-use

以cufftPlanMany為例FFT變換中embed,stride,dist的解釋與設定

繼續閱讀

DAT（NIPS 2018）視訊目标跟蹤源碼運作筆記1. 論文基本資訊2. 運作環境介紹3. 準備4. 配置5. 運作6. 可能出現的問題及解決方法

Ubuntu16.04+Pytorch1.4.0+cuda10.0的pip安裝

【CUDA-C/C++】任意次元矩陣乘

WSL2 的docker裡使用顯示卡的安裝

關于安裝pytorch的一些問題總結

jetson nano ubuntu 安裝opencv4 cuda10 pytorch

Jetson Nano Ubuntu編譯OpenCV4.4.0+opencv_contrib(帶CUDA)

大學、碩士、博士的差別是什麼？

yolov7 tensorrt模型加速部署【實戰】

STFT filter bankSTFT filter bank

linxu下CUDA靜态庫-上

相位相關算法

《cuda并行程式設計》勘誤（3）

《cuda并行程式設計》勘誤（2）

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory（完美解決）

一種解決思路： ImportError: libcublas.so.10.0: cannot open shared object file: No such file