本文分享自華為雲社群《【2023 · CANN訓練營第一季】——Ascend C sqrt算子實戰-雲社群-華為雲》，作者：dayao。

前言

編寫一個Ascend C的sqrt算子，并通過核心調用方式在cpu和npu模式下進行驗證。在訓練營沙箱環境下，cpu模式工作正常結果正确。

一、概述

先簡單回顧下TIK C++算子矢量程式設計的流程和實作。

矢量算子開發流程如下：

主要工作内容有：

1、算子分析：确定輸入輸出，确定數學表達式以及底層實作接口，确定核函數定義。

2、算子類的實作：實作init()和process()。init()完成記憶體初始化，實質上展現的是多核運作，和單核資料切分以及是否開啟double buffer優化；Process()實作的是CopyIn，Compute、CopyOut三個流水任務。

3、算子驗證：通過核函數的核心調用符的方式調用算子，計算出結果，并于使用相同輸入用numpy計算結果進行比對，誤差在一定範圍内即可。實際應用中，需要使用原有架構的算子進行計算精度比對。

二、算子分析

算子定義如下：假定仍是8個邏輯核。

查詢TIK C++的API可知，可以使用（TIK C++ API/矢量計算/單目/Sqrt，采用2級接口）完成運算，得到最終結果。

三、代碼分析

直接在訓練營課程提供的add_tik2算子工程上修改。代碼位址：https://gitee.com/zgx950813/samples/tree/master/tik2_demo/kernel_samples/kernel_add_sample

修改代碼目錄結構如下：CMakeLists.txt和data_utils.h未作修改，編譯和執行腳本run.sh隻改了計算結果與真值比對部分。

一）、核函數定義

與例程相比，輸入參數隻有x。

extern "C" __global__ __aicore__ void sqrt_tik2(__gm__ uint8_t* x, __gm__ uint8_t* z)

{

KernelSqrt op;

op.Init(x, z);

op.Process();

}

二）、算子類

實作方式與add例程類似。init（）函數裡初始化記憶體：x，y的Global Memory ；流水線任務通訊記憶體；Process（）實作流水線任務；按範式編寫CopyIn、Compute、CopyOut。與add例程最大差異是，在compute函數中，調用sqrt的2類接口API實作計算。

class KernelSqrt {

public:

__aicore__ inline KernelSqrt() {}

__aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* z)

{

// get start index for current core, core parallel

xGm.SetGlobalBuffer((__gm__ half*)x + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

zGm.SetGlobalBuffer((__gm__ half*)z + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

// pipe alloc memory to queue, the unit is Bytes

pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));

pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));

}

__aicore__ inline void Process()

{

// loop count need to be doubled, due to double buffer

constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;

// tiling strategy, pipeline parallel

for (int32_t i = 0; i < loopCount; i++) {

CopyIn(i);

Compute(i);

CopyOut(i);

}

}

private:

__aicore__ inline void CopyIn(int32_t progress)

{

// alloc tensor from queue memory

LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();

// copy progress_th tile from global tensor to local tensor

DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);

// enque input tensors to VECIN queue

inQueueX.EnQue(xLocal);

}

__aicore__ inline void Compute(int32_t progress)

{

// deque input tensors from VECIN queue

LocalTensor<half> xLocal = inQueueX.DeQue<half>();

LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

// call Sqrt instr for computation

Sqrt(zLocal, xLocal, TILE_LENGTH);

// enque the output tensor to VECOUT queue

outQueueZ.EnQue<half>(zLocal);

// free input tensors for reuse

inQueueX.FreeTensor(xLocal);

}

__aicore__ inline void CopyOut(int32_t progress)

{

// deque output tensor from VECOUT queue

LocalTensor<half> zLocal = outQueueZ.DeQue<half>();

// copy progress_th tile from local tensor to global tensor

DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);

// free output tensor for reuse

outQueueZ.FreeTensor(zLocal);

}

private:

TPipe pipe;

// create queues for input, in this case depth is equal to buffer num

TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX;

// create queue for output, in this case depth is equal to buffer num

TQue<QuePosition::VECOUT, BUFFER_NUM> outQueueZ;

GlobalTensor<half> xGm, zGm;

};

三）、核函數調用

1、在CPU模式下，通過ICPU_RUN_KF調用

ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug

2、在NPU模式下，通過<<<>>>調用

#ifndef __CCE_KT_TEST__

// call of kernel function

void sqrt_tik2_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* z)

{

sqrt_tik2<<<blockDim, l2ctrl, stream>>>(x, z);

}

#endif

由于<<<>>>，隻能在NPU模式下調用，是以需要用條件編譯，不在CPU調試模式下有效。在調用sqrt_tik2_do，需要按ascendcl應用程式設計的要求進行。

3、調用代碼

通過“__CCE_KT_TEST__”宏區分CPU和NPU模式。

int32_t main(int32_t argc, char* argv[])

{

size_t inputByteSize = 8 * 2048 * sizeof(uint16_t); // uint16_t represent half

size_t outputByteSize = 8 * 2048 * sizeof(uint16_t); // uint16_t represent half

uint32_t blockDim = 8;

#ifdef __CCE_KT_TEST__

uint8_t* x = (uint8_t*)tik2::GmAlloc(inputByteSize);

uint8_t* z = (uint8_t*)tik2::GmAlloc(outputByteSize);

ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);

// PrintData(x, 16, printDataType::HALF);

ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug

// PrintData(z, 16, printDataType::HALF);

WriteFile("./output/output_z.bin", z, outputByteSize);

tik2::GmFree((void *)x);

tik2::GmFree((void *)z);

#else

aclInit(nullptr);

aclrtContext context;

aclError error;

int32_t deviceId = 0;

aclrtCreateContext(&context, deviceId);

aclrtStream stream = nullptr;

aclrtCreateStream(&stream);

uint8_t *xHost, *zHost;

uint8_t *xDevice, *zDevice;

aclrtMallocHost((void**)(&xHost), inputByteSize);

aclrtMallocHost((void**)(&zHost), outputByteSize);

aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);

aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);

ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);

// PrintData(xHost, 16, printDataType::HALF);

aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE);

sqrt_tik2_do(blockDim, nullptr, stream, xDevice, zDevice); // call kernel in this function

aclrtSynchronizeStream(stream);

aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST);

// PrintData(zHost, 16, printDataType::HALF);

WriteFile("./output/output_z.bin", zHost, outputByteSize);

aclrtFree(xDevice);

aclrtFree(zDevice);

aclrtFreeHost(xHost);

aclrtFreeHost(zHost);

aclrtDestroyStream(stream);

aclrtResetDevice(deviceId);

aclFinalize();

#endif

return 0;

}

四）、基準資料生成——sqrt_tik2.py

使用numpy生成input_x和基準結果golden。

import numpy as np

def gen_golden_data_simple():

input_x = np.random.uniform(0, 100, [8, 2048]).astype(np.float16)

golden = np.sqrt(input_x).astype(np.float16)

input_x.tofile("./input/input_x.bin")

golden.tofile("./output/golden.bin")

if __name__ == "__main__":

gen_golden_data_simple()

五）、計算結果比較

使用numpy的allclose()函數比較算子計算與基準資料的結果。實際上由于npu模式編譯出錯，實際未執行改函數進行比較。CPU模式下，算子計算出的結果與基準golden資料完全一緻，兩者的md5相同。

四、編譯運作

本次課程提供了沙箱運作環境，想個辦法把代碼搞進去。

一）、配置環境變量

二）、CPU模式

cpu模式順利編譯運作，結果與對比組完全一緻。

三）、NPU模式

npu模式下編譯報錯，因為沙箱時間有限，以後有機會再研究。

關注#華為雲開發者聯盟# 點選下方，第一時間了解華為雲新鮮技術~

華為雲部落格_大資料部落格_AI部落格_雲計算部落格_開發者中心-華為雲