模型部署架構類型
算法模型的部署主要可以分成兩個方面。一是在移動端/邊緣端的部署,即嵌入式,通常以SDK形式呈現。另一個是雲端/服務端,通常以服務的形式呈現;今天着重聊聊部署流程,後續移動端部署、具體廠商的智能硬體部署、雲端server會開專題介紹。
邊緣端
模型訓練:通過pytorch、tensorflow等深度學習架構進行訓練算法模型,得到模型權重檔案,模型訓練部分今天不着重介紹,後續專題會展開讨論訓練tricks、模型調優、模型剪枝、蒸餾、量化。
模型轉化:把權重檔案轉為對應智能硬體的形态,友善利用對應的GPU、NPU或者IPU智能硬體加速單元來達到加速效果。
算法部署:依照原模型算法推理邏輯對應實作在嵌入式端。
模型轉化
包括英偉達、⾼通、華為、AMD在内的⼚家,都在神經⽹絡加速⽅⾯投⼊了研發⼒量。通過量化、裁剪和壓縮來降低模型尺⼨。更快的推斷可以通過在降低精度的前提下使⽤⾼效計算平台⽽達到,其中包括intel MKL-DNN,ARM CMSIS,Qualcomm SNPE,Nvidia TensorRT,海思、RockChip RKNN,SigmarStar SGS_IPU等。
依TensorRT為例,其他平台的部署系列後面會出詳細手把手教程。
TensorRT
方式一:把訓練得到的權重檔案如(pt,pb)先轉化為Onnx形式,使用onnx-simplifier對模型進行圖優化,得到一個簡潔明了的模型圖,最後通過trtexec轉為對應的engine檔案。
以Yolov5為例,導出onnx代碼
# YOLOv5 ONNX export
try:
check_requirements(('onnx',))
import onnx
LOGGER.info(f'\n{prefix} starting export with onnx {onnx.__version__}...')
f = file.with_suffix('.onnx')
torch.onnx.export(model, im, f, verbose=False, opset_version=opset,
training=torch.onnx.TrainingMode.TRAINING if train else torch.onnx.TrainingMode.EVAL,
do_constant_folding=not train,
input_names=['images'],
output_names=['output'],
dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'}, # shape(1,3,640,640)
'output': {0: 'batch', 1: 'anchors'} # shape(1,25200,85)
} if dynamic else None)
# Checks
model_onnx = onnx.load(f) # load onnx model
onnx.checker.check_model(model_onnx) # check onnx model
# LOGGER.info(onnx.helper.printable_graph(model_onnx.graph)) # print
# Simplify
if simplify:
try:
check_requirements(('onnx-simplifier',))
import onnxsim
LOGGER.info(f'{prefix} simplifying with onnx-simplifier {onnxsim.__version__}...')
model_onnx, check = onnxsim.simplify(
model_onnx,
dynamic_input_shape=dynamic,
input_shapes={'images': list(im.shape)} if dynamic else None)
assert check, 'assert check failed'
onnx.save(model_onnx, f)
except Exception as e:
LOGGER.info(f'{prefix} simplifier failure: {e}')
LOGGER.info(f'{prefix} export success, saved as {f} ({file_size(f):.1f} MB)')
return f
except Exception as e:
LOGGER.info(f'{prefix} export failure: {e}')
導出後的Onnx模型圖
Yolov5-Onnx-結構圖
然後執行
trtexec --onnx=weights/yolov5s.onnx --saveEngine=weights/yolov5s.engine
對于YOLOV5,官方已經提供了一鍵轉各種格式的腳本,具體參考
在此僅提供模型轉化的方法思路。
方式二:根據TensorRT官方API文檔,手動搭模組化型結構,最後根據API接口把模型轉成engine檔案。
同樣的依照Yolov5為例:
提取模型權重
import sys
import argparse
import os
import struct
import torch
from utils.torch_utils import select_device
def parse_args():
parser = argparse.ArgumentParser(description='Convert .pt file to .wts')
parser.add_argument('-w', '--weights', required=True,
help='Input weights (.pt) file path (required)')
parser.add_argument(
'-o', '--output', help='Output (.wts) file path (optional)')
parser.add_argument(
'-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg'],
help='determines the model is detection/classification')
args = parser.parse_args()
if not os.path.isfile(args.weights):
raise SystemExit('Invalid input file')
if not args.output:
args.output = os.path.splitext(args.weights)[0] + '.wts'
elif os.path.isdir(args.output):
args.output = os.path.join(
args.output,
os.path.splitext(os.path.basename(args.weights))[0] + '.wts')
return args.weights, args.output, args.type
pt_file, wts_file, m_type = parse_args()
print(f'Generating .wts for {m_type} model')
# Load model
print(f'Loading {pt_file}')
device = select_device('cpu')
model = torch.load(pt_file, map_location=device) # Load FP32 weights
model = model['ema' if model.get('ema') else 'model'].float()
if m_type in ['detect', 'seg']:
# update anchor_grid info
anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]
# model.model[-1].anchor_grid = anchor_grid
delattr(model.model[-1], 'anchor_grid') # model.model[-1] is detect layer
# The parameters are saved in the OrderDict through the "register_buffer" method, and then saved to the weight.
model.model[-1].register_buffer("anchor_grid", anchor_grid)
model.model[-1].register_buffer("strides", model.model[-1].stride)
model.to(device).eval()
print(f'Writing into {wts_file}')
with open(wts_file, 'w') as f:
f.write('{}\n'.format(len(model.state_dict().keys())))
for k, v in model.state_dict().items():
vr = v.reshape(-1).cpu().numpy()
f.write('{} {} '.format(k, len(vr)))
for vv in vr:
f.write(' ')
f.write(struct.pack('>f', float(vv)).hex())
f.write('\n')
根據API接口建構編譯Yolov5模型結構
核心代碼塊
ICudaEngine* build_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {
INetworkDefinition* network = builder->createNetworkV2(0U);
// Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });
assert(data);
std::map<std::string, Weights> weightMap = loadWeights(wts_name);
/* ------ yolov5 backbone------ */
auto conv0 = convBlock(network, weightMap, *data, get_width(64, gw), 6, 2, 1, "model.0");
assert(conv0);
auto conv1 = convBlock(network, weightMap, *conv0->getOutput(0), get_width(128, gw), 3, 2, 1, "model.1");
auto bottleneck_CSP2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, "model.2");
auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), get_width(256, gw), 3, 2, 1, "model.3");
auto bottleneck_csp4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(6, gd), true, 1, 0.5, "model.4");
auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), get_width(512, gw), 3, 2, 1, "model.5");
auto bottleneck_csp6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, "model.6");
auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), get_width(1024, gw), 3, 2, 1, "model.7");
auto bottleneck_csp8 = C3(network, weightMap, *conv7->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), true, 1, 0.5, "model.8");
auto spp9 = SPPF(network, weightMap, *bottleneck_csp8->getOutput(0), get_width(1024, gw), get_width(1024, gw), 5, "model.9");
/* ------ yolov5 head ------ */
auto conv10 = convBlock(network, weightMap, *spp9->getOutput(0), get_width(512, gw), 1, 1, 1, "model.10");
auto upsample11 = network->addResize(*conv10->getOutput(0));
assert(upsample11);
upsample11->setResizeMode(ResizeMode::kNEAREST);
upsample11->setOutputDimensions(bottleneck_csp6->getOutput(0)->getDimensions());
ITensor* inputTensors12[] = { upsample11->getOutput(0), bottleneck_csp6->getOutput(0) };
auto cat12 = network->addConcatenation(inputTensors12, 2);
auto bottleneck_csp13 = C3(network, weightMap, *cat12->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.13");
auto conv14 = convBlock(network, weightMap, *bottleneck_csp13->getOutput(0), get_width(256, gw), 1, 1, 1, "model.14");
auto upsample15 = network->addResize(*conv14->getOutput(0));
assert(upsample15);
upsample15->setResizeMode(ResizeMode::kNEAREST);
upsample15->setOutputDimensions(bottleneck_csp4->getOutput(0)->getDimensions());
ITensor* inputTensors16[] = { upsample15->getOutput(0), bottleneck_csp4->getOutput(0) };
auto cat16 = network->addConcatenation(inputTensors16, 2);
auto bottleneck_csp17 = C3(network, weightMap, *cat16->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, "model.17");
/* ------ detect ------ */
IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.0.weight"], weightMap["model.24.m.0.bias"]);
auto conv18 = convBlock(network, weightMap, *bottleneck_csp17->getOutput(0), get_width(256, gw), 3, 2, 1, "model.18");
ITensor* inputTensors19[] = { conv18->getOutput(0), conv14->getOutput(0) };
auto cat19 = network->addConcatenation(inputTensors19, 2);
auto bottleneck_csp20 = C3(network, weightMap, *cat19->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, "model.20");
IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.1.weight"], weightMap["model.24.m.1.bias"]);
auto conv21 = convBlock(network, weightMap, *bottleneck_csp20->getOutput(0), get_width(512, gw), 3, 2, 1, "model.21");
ITensor* inputTensors22[] = { conv21->getOutput(0), conv10->getOutput(0) };
auto cat22 = network->addConcatenation(inputTensors22, 2);
auto bottleneck_csp23 = C3(network, weightMap, *cat22->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, "model.23");
IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap["model.24.m.2.weight"], weightMap["model.24.m.2.bias"]);
auto yolo = addYoLoLayer(network, weightMap, "model.24", std::vector<IConvolutionLayer*>{det0, det1, det2});
yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*yolo->getOutput(0));
// Build engine
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(16 * (1 << 20)); // 16MB
#if defined(USE_FP16)
config->setFlag(BuilderFlag::kFP16);
#elif defined(USE_INT8)
std::cout << "Your platform support int8: " << (builder->platformHasFastInt8() ? "true" : "false") << std::endl;
assert(builder->platformHasFastInt8());
config->setFlag(BuilderFlag::kINT8);
Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, "./coco_calib/", "int8calib.table", INPUT_BLOB_NAME);
config->setInt8Calibrator(calibrator);
#endif
std::cout << "Building engine, please wait for a while..." << std::endl;
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
std::cout << "Build engine successfully!" << std::endl;
// Don't need the network any more
network->destroy();
// Release host memory
for (auto& mem : weightMap) {
free((void*)(mem.second.values));
}
return engine;
}
具體參照
方式三:跟方式一一樣先轉成onnx圖模型,根據TensorRT-onnx_parser模型轉成engine檔案。
核心代碼塊
bool compile(
Mode mode,
unsigned int maxBatchSize,
const ModelSource& source,
const CompileOutput& saveto,
std::vector<InputDims> inputsDimsSetup,
Int8Process int8process,
const std::string& int8ImageDirectory,
const std::string& int8EntropyCalibratorFile,
const size_t maxWorkspaceSize) {
if (mode == Mode::INT8 && int8process == nullptr) {
INFOE("int8process must not nullptr, when in int8 mode.");
return false;
}
bool hasEntropyCalibrator = false;
vector<uint8_t> entropyCalibratorData;
vector<string> entropyCalibratorFiles;
if (mode == Mode::INT8) {
if (!int8EntropyCalibratorFile.empty()) {
if (iLogger::exists(int8EntropyCalibratorFile)) {
entropyCalibratorData = iLogger::load_file(int8EntropyCalibratorFile);
if (entropyCalibratorData.empty()) {
INFOE("entropyCalibratorFile is set as: %s, but we read is empty.", int8EntropyCalibratorFile.c_str());
return false;
}
hasEntropyCalibrator = true;
}
}
if (hasEntropyCalibrator) {
if (!int8ImageDirectory.empty()) {
INFOW("imageDirectory is ignore, when entropyCalibratorFile is set");
}
}
else {
if (int8process == nullptr) {
INFOE("int8process must be set. when Mode is '%s'", mode_string(mode));
return false;
}
entropyCalibratorFiles = iLogger::find_files(int8ImageDirectory, "*.jpg;*.png;*.bmp;*.jpeg;*.tiff");
if (entropyCalibratorFiles.empty()) {
INFOE("Can not find any images(jpg/png/bmp/jpeg/tiff) from directory: %s", int8ImageDirectory.c_str());
return false;
}
if(entropyCalibratorFiles.size() < maxBatchSize){
INFOW("Too few images provided, %d[provided] < %d[max batch size], image copy will be performed", entropyCalibratorFiles.size(), maxBatchSize);
int old_size = entropyCalibratorFiles.size();
for(int i = old_size; i < maxBatchSize; ++i)
entropyCalibratorFiles.push_back(entropyCalibratorFiles[i % old_size]);
}
}
}
else {
if (hasEntropyCalibrator) {
INFOW("int8EntropyCalibratorFile is ignore, when Mode is '%s'", mode_string(mode));
}
}
INFO("Compile %s %s.", mode_string(mode), source.descript().c_str());
shared_ptr<IBuilder> builder(createInferBuilder(gLogger), destroy_nvidia_pointer<IBuilder>);
if (builder == nullptr) {
INFOE("Can not create builder.");
return false;
}
shared_ptr<IBuilderConfig> config(builder->createBuilderConfig(), destroy_nvidia_pointer<IBuilderConfig>);
if (mode == Mode::FP16) {
if (!builder->platformHasFastFp16()) {
INFOW("Platform not have fast fp16 support");
}
config->setFlag(BuilderFlag::kFP16);
}
else if (mode == Mode::INT8) {
if (!builder->platformHasFastInt8()) {
INFOW("Platform not have fast int8 support");
}
config->setFlag(BuilderFlag::kINT8);
}
shared_ptr<INetworkDefinition> network;
//shared_ptr<ICaffeParser> caffeParser;
shared_ptr<nvonnxparser::IParser> onnxParser;
if(source.type() == ModelSourceType::OnnX || source.type() == ModelSourceType::OnnXData){
const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
network = shared_ptr<INetworkDefinition>(builder->createNetworkV2(explicitBatch), destroy_nvidia_pointer<INetworkDefinition>);
vector<nvinfer1::Dims> dims_setup(inputsDimsSetup.size());
for(int i = 0; i < inputsDimsSetup.size(); ++i){
auto s = inputsDimsSetup[i];
dims_setup[i] = convert_to_trt_dims(s.dims());
dims_setup[i].d[0] = -1;
}
//from onnx is not markOutput
onnxParser.reset(nvonnxparser::createParser(*network, gLogger, dims_setup), destroy_nvidia_pointer<nvonnxparser::IParser>);
if (onnxParser == nullptr) {
INFOE("Can not create parser.");
return false;
}
if(source.type() == ModelSourceType::OnnX){
if (!onnxParser->parseFromFile(source.onnxmodel().c_str(), 1)) {
INFOE("Can not parse OnnX file: %s", source.onnxmodel().c_str());
return false;
}
}else{
if (!onnxParser->parseFromData(source.onnx_data(), source.onnx_data_size(), 1)) {
INFOE("Can not parse OnnX file: %s", source.onnxmodel().c_str());
return false;
}
}
}
else {
INFOE("not implementation source type: %d", source.type());
Assert(false);
}
set_layer_hook_reshape(nullptr);
auto inputTensor = network->getInput(0);
auto inputDims = inputTensor->getDimensions();
shared_ptr<Int8EntropyCalibrator> int8Calibrator;
if (mode == Mode::INT8) {
auto calibratorDims = inputDims;
calibratorDims.d[0] = maxBatchSize;
if (hasEntropyCalibrator) {
INFO("Using exist entropy calibrator data[%d bytes]: %s", entropyCalibratorData.size(), int8EntropyCalibratorFile.c_str());
int8Calibrator.reset(new Int8EntropyCalibrator(
entropyCalibratorData, calibratorDims, int8process
));
}
else {
INFO("Using image list[%d files]: %s", entropyCalibratorFiles.size(), int8ImageDirectory.c_str());
int8Calibrator.reset(new Int8EntropyCalibrator(
entropyCalibratorFiles, calibratorDims, int8process
));
}
config->setInt8Calibrator(int8Calibrator.get());
}
INFO("Input shape is %s", join_dims(vector<int>(inputDims.d, inputDims.d + inputDims.nbDims)).c_str());
INFO("Set max batch size = %d", maxBatchSize);
INFO("Set max workspace size = %.2f MB", maxWorkspaceSize / 1024.0f / 1024.0f);
INFO("Base device: %s", CUDATools::device_description().c_str());
int net_num_input = network->getNbInputs();
INFO("Network has %d inputs:", net_num_input);
vector<string> input_names(net_num_input);
for(int i = 0; i < net_num_input; ++i){
auto tensor = network->getInput(i);
auto dims = tensor->getDimensions();
auto dims_str = join_dims(vector<int>(dims.d, dims.d+dims.nbDims));
INFO(" %d.[%s] shape is %s", i, tensor->getName(), dims_str.c_str());
input_names[i] = tensor->getName();
}
int net_num_output = network->getNbOutputs();
INFO("Network has %d outputs:", net_num_output);
for(int i = 0; i < net_num_output; ++i){
auto tensor = network->getOutput(i);
auto dims = tensor->getDimensions();
auto dims_str = join_dims(vector<int>(dims.d, dims.d+dims.nbDims));
INFO(" %d.[%s] shape is %s", i, tensor->getName(), dims_str.c_str());
}
int net_num_layers = network->getNbLayers();
INFO("Network has %d layers:", net_num_layers);
for(int i = 0; i < net_num_layers; ++i){
auto layer = network->getLayer(i);
auto name = layer->getName();
auto type_str = layer_type_name(layer);
auto input0 = layer->getInput(0);
if(input0 == nullptr) continue;
auto output0 = layer->getOutput(0);
auto input_dims = input0->getDimensions();
auto output_dims = output0->getDimensions();
bool has_input = layer_has_input_tensor(layer);
bool has_output = layer_has_output_tensor(layer);
auto descript = layer_descript(layer);
type_str = iLogger::align_blank(type_str, 18);
auto input_dims_str = iLogger::align_blank(dims_str(input_dims), 18);
auto output_dims_str = iLogger::align_blank(dims_str(output_dims), 18);
auto number_str = iLogger::align_blank(format("%d.", i), 4);
const char* token = " ";
if(has_input)
token = " >>> ";
else if(has_output)
token = " *** ";
INFOV("%s%s%s %s-> %s%s", token,
number_str.c_str(),
type_str.c_str(),
input_dims_str.c_str(),
output_dims_str.c_str(),
descript.c_str()
);
}
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(maxWorkspaceSize);
auto profile = builder->createOptimizationProfile();
for(int i = 0; i < net_num_input; ++i){
auto input = network->getInput(i);
auto input_dims = input->getDimensions();
input_dims.d[0] = 1;
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, input_dims);
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, input_dims);
input_dims.d[0] = maxBatchSize;
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, input_dims);
}
// not need
// for(int i = 0; i < net_num_output; ++i){
// auto output = network->getOutput(i);
// auto output_dims = output->getDimensions();
// output_dims.d[0] = 1;
// profile->setDimensions(output->getName(), nvinfer1::OptProfileSelector::kMIN, output_dims);
// profile->setDimensions(output->getName(), nvinfer1::OptProfileSelector::kOPT, output_dims);
// output_dims.d[0] = maxBatchSize;
// profile->setDimensions(output->getName(), nvinfer1::OptProfileSelector::kMAX, output_dims);
// }
config->addOptimizationProfile(profile);
// error on jetson
// auto timing_cache = shared_ptr<nvinfer1::ITimingCache>(config->createTimingCache(nullptr, 0), [](nvinfer1::ITimingCache* ptr){ptr->reset();});
// config->setTimingCache(*timing_cache, false);
// config->setFlag(BuilderFlag::kGPU_FALLBACK);
// config->setDefaultDeviceType(DeviceType::kDLA);
// config->setDLACore(0);
INFO("Building engine...");
auto time_start = iLogger::timestamp_now();
shared_ptr<ICudaEngine> engine(builder->buildEngineWithConfig(*network, *config), destroy_nvidia_pointer<ICudaEngine>);
if (engine == nullptr) {
INFOE("engine is nullptr");
return false;
}
if (mode == Mode::INT8) {
if (!hasEntropyCalibrator) {
if (!int8EntropyCalibratorFile.empty()) {
INFO("Save calibrator to: %s", int8EntropyCalibratorFile.c_str());
iLogger::save_file(int8EntropyCalibratorFile, int8Calibrator->getEntropyCalibratorData());
}
else {
INFO("No set entropyCalibratorFile, and entropyCalibrator will not save.");
}
}
}
INFO("Build done %lld ms !", iLogger::timestamp_now() - time_start);
// serialize the engine, then close everything down
shared_ptr<IHostMemory> seridata(engine->serialize(), destroy_nvidia_pointer<IHostMemory>);
if(saveto.type() == CompileOutputType::File){
return iLogger::save_file(saveto.file(), seridata->data(), seridata->size());
}else{
((CompileOutput&)saveto).set_data(vector<uint8_t>((uint8_t*)seridata->data(), (uint8_t*)seridata->data()+seridata->size()));
return true;
}
}
}; //namespace TRTBuilder
具體參照
至此模型轉化這部分完成。
三種方式的優缺點:
方式一、方式三相對于方式二更為簡單友善快捷,特别方式一零代碼即可實作模型的轉化,反觀方式二需要清晰模型結構,清晰API接口算子并手撸代碼完成建構engine。但方式一、方式三對于一些模型如transform、Vit模型由于一些算子還未支援,故不能一鍵轉化,而方式二則可完成,總體來說方式二相比其他更為靈活,但上手難度更大。
算法部署
整體流程
流程圖
輸入:着重說下視訊流如rtsp、webrtc、rtmp這種實時視訊流,我們需要先對流進行解碼進而得到RGB圖像(YUV420、NV12、NV21 -> RGB),其中解碼又分為軟解碼和硬解碼,軟解碼如libx264,libx265等,硬解碼如Nvidia的CUVID以及海思,RockChip的Mpp等,關于視訊流的編解碼後續會開專題詳細介紹。
預處理:把得到的RGB圖像依照跟訓練時進行同樣的預處理,如Yolov5需要自适應縮放、歸一化操作;人臉檢測scrfd需要自适應縮放、減均值127.5,除方差128等操作;對于自适應縮放可以采用仿射變換、letterbox的形式實作;對于減均值、除方差,NVIDIA可以采用CUDA進行操作,進而達到提速的效果。
模型推理:把經過上邊兩步的圖像data送進序列化好的engine進行model_forward,得到output_tensor。
後處理:把上述得到的output_tensor,進行後處理decode,依照目标檢測為例這個操作一般為general_anchor、nms、iou,坐标映射到原圖(transform_pred)等操作;分類模型則一般為get_max_pred;姿态識别模型一般為keypoints_from_heatmap、transform_pred等。
輸出:經過後處理後,就得到了最終的輸出結果,如檢測項,分類類别,keypoints,人臉坐标等等,最終可根據實際場景進行告警推送等應用開發,或者把告警圖檔進行編碼(RGB->YUV420)以視訊流的方式推送到流媒體伺服器。
生成SDK
對于Hisi3516、3519或者rv1126、rv1109這類平台,flash空間小,需要交叉編譯,可打包成動态連結庫,提供接口函數供上層應用調用;對于rv3399、rk3568、jetson産品自帶Ubuntu或者Linaro系統,可終端機自行編譯,并且可部署python,可利用pybind11進行銜接互動。
雲服務端
關于模型的雲端部署,業界也有許多開源的解決方案,但目前為止來看,還沒有一種真的可以一統業界,或者說稱得上是絕對主流的方案。
針對雲端部署的架構裡,我們可以大緻分為兩類,一種是主要着力于解決推理性能,提高推理速度的架構,這一類裡有諸如tensorflow的tensorflow serving、NVIDIA基于他們tensorRt的Triton(原TensorRt Serving),onnx-runtime,國内的paddle servering等, 将模型轉化為某一特定形式(轉化的過程中可能伴有一些優化的操作), 并對外提供服務,以此來獲得相對較高的性能。
另一類架構主要着眼于結合模型整個生命周期,對模型部署進行管理,比如mlflow、seldon、bentoml、cortex等等,這些架構的設計與思路其實五花八門,有的為了和訓練部分接軌,把模型檔案管理也納入了。有的則是隻管到容器編排的部分,使用者需要自己做好容器,它幫你發到k8s上之類的(這種情況甚至能和第一類架構連起來用)。當然也有專注于模型推理這一小塊的。
寫在最後
算法應用落地部署已然成為AI領域關鍵的一環,由于國外産品制裁,我們也大力支援國産智能硬體AI落地,已在海思、瑞芯微、sigmastar、寒武紀、地平線等國産晶片部署多款算法,如目标檢測(YOLOV5等)、人臉識别(scrfd+arcface等)、姿勢識别(lite-hrnet等)、動作序列識别(tsm等),目标追蹤(MOT,bytetrack),擁有多行業、多領域真實資料集,并形成多款AI智能産品,落地應用在安防、加油站、充電樁、火車站、商場等各大行業,後續也會開設專題介紹各大智能硬體、各大算法的詳細部署流程,緻力于發展壯大國産AI部署社群生态。
今天就先到這裡,謝謝,點點關注不迷路。
智驅力-科技驅動生産力