Kubernetes使用NVIDIA的k8s-device-plugin原理分析

1. 環境準備

一個已經安裝了GPU驅動的kubernetes叢集，安裝方式參考https://github.com/NVIDIA/k8s-device-plugin

通過describe指令就可以看到驅動已經将GPU資訊上報到了kubernetes的node資訊當中

Kubernetes使用NVIDIA的k8s-device-plugin原理分析

業務pod中使用GPU資源跟使用CPU一樣，配置下

containers.[*].resources.limits.nvidia.com/gpu

即可，如下：

apiVersion: v1
kind: Pod
metadata:
name: centos-1
labels:
app: centos
spec:
containers:
- name: centos-1
image: centos
imagePullPolicy: IfNotPresent
command: ["top","-b"]
resources:
limits:
nvidia.com/gpu: 1

在resource當中可以用的有如下：

nvidia.com/gpu：允許Pod容器通路NVIDIA GPU。當Pod被排程到一個節點時，該節點必須具有NVIDIA GPU裝置，否則Pod将無法啟動。此資源聲明還會自動将

nvidia-container-runtime

作為Pod容器運作時注入到Pod中，以便容器中的應用程式可以通路GPU。該容器運作時将確定容器中的應用程式可以通路GPU，并将為容器提供NVIDIA顯示卡驅動程式和CUDA庫等相關工具

nvidia.com/gpu-memory：用于指定要配置設定給Pod容器的GPU記憶體的數量。該資源聲明還可以用于限制容器可以使用的GPU記憶體數量，以便控制Pod對GPU的占用

nvidia.com/gpu-p2p：允許啟用GPU對等互連（GPU Direct RDMA）的選項，該選項允許GPU之間進行直接資料傳輸，而無需通過CPU進行中轉。這種直接的GPU通信可以提高系統性能和降低延遲。（這個需要運作支援GPU Direct RDMA的應用程式，如MPI等）

nvidia.com/gpu-instance-profile：允許在 Kubernetes Pod 中指定使用的 NVIDIA GPU 執行個體配置檔案。這些配置檔案中定義了 GPU 執行個體可用的 GPU 數量、顯存大小、GPU 時鐘頻率等資訊，可以用來限制 Pod 對 GPU 的使用（該特性需要 NVIDIA CUDA Toolkit 11.2 或更高版本的支援，并且需要在運作 NVIDIA Device Plugin 的節點上安裝 NVIDIA GPU Instance Profiles 包。如果你的 Kubernetes 叢集中滿足這些要求，那麼可以在 Pod 的 YAML 檔案中使用

nvidia.com/gpu-instance-profile

來限制 GPU 的使用。）

2. 源碼分析

啟動的源碼在: cmd/nvidia-device-plugin/main.go#startPlugins

func startPlugins(c *cli.Context, flags []cli.Flag, restarting bool) ([]plugin.Interface, bool, error) {
 ......
// 擷取擷取插件的Manager
 pluginManager, err := NewPluginManager(config)
if err != nil {
return nil, false, fmt.Errorf("error creating plugin manager: %v", err)
 }
// 擷取插件組
 plugins, err := pluginManager.GetPlugins()
if err != nil {
return nil, false, fmt.Errorf("error getting plugins: %v", err)
 }
 .....
}

pluginManager.GetPlugins() -->> rm.NewNVMLResourceManagers(m.nvmllib, m.config) -->> NewDeviceMap(nvmllib, config)

NewDeviceMap建立

DeviceMap

對象，其中包含了節點上所有可用裝置的詳細資訊。

具體來說，

NewDeviceMap()

函數通過讀取節點上的裝置資訊（如PCI位址、裝置類型、裝置名稱等），将其轉換為Kubernetes API對象并注冊到叢集中。在此過程中，還會根據使用者的配置資訊選擇要注冊的裝置類型，并将裝置映射到容器中。

在

k8s-device-plugin

的源碼中，

deviceMap

對象是一個重要的資料結構，它将裝置資訊轉換為Kubernetes API對象，并提供了一系列方法用于擷取可用裝置的資訊。建立

deviceMap

對象是裝置插件啟動過程的一部分，是裝置插件能夠正常運作的前提。

// NewDeviceMap creates a device map for the specified NVML library and config.
func NewDeviceMap(nvmllib nvml.Interface, config *spec.Config) (DeviceMap, error) {
 b := deviceMapBuilder{
 Interface: device.New(device.WithNvml(nvmllib)),
 config: config,
 }
return b.build()
}

device.New(device.WithNvml(nvmllib))

建立了一個

Device

對象，并将Nvidia GPU顯存管理庫（NVML）與其關聯。然後将該對象作為

deviceMapBuilder

的

Interface

成員變量指派。

pluginManager.GetPlugins() -->> plugin.NewNvidiaDevicePlugin(m.config, r, m.cdiHandler, m.cdiEnabled)

NewNvidiaDevicePlugin

方法的作用是建立一個新的NVIDIA裝置插件對象，并對其進行一些初始化配置，使其能夠與Kubernetes叢集協作管理裝置資源。

接收一個 config 參數，它是裝置插件的配置資訊，包含了裝置插件的名稱、資源限制、裝置清單等資訊。
接收一個 resourceManager 參數，它是裝置資料總管，負責對裝置資源進行排程和管理。
接收一個 cdiHandler 參數，它是CDI（Containerized Data Import）處理器，用于将GPU驅動程式注入到容器中，以便容器可以直接通路GPU。
接收一個 cdiEnabled 參數，表示是否啟用CDI功能。
建立一個新的 NvidiaDevicePlugin 對象，并将參數傳遞給其構造函數。
傳回新建立的 NvidiaDevicePlugin 對象。

// NewNvidiaDevicePlugin returns an initialized NvidiaDevicePlugin
func NewNvidiaDevicePlugin(config *spec.Config, resourceManager rm.ResourceManager, cdiHandler cdi.Interface, cdiEnabled bool) *NvidiaDevicePlugin {
 _, name := resourceManager.Resource().Split()

 deviceListStrategies, _ := spec.NewDeviceListStrategies(*config.Flags.Plugin.DeviceListStrategy)

return &NvidiaDevicePlugin{
 rm: resourceManager,
 config: config,
 deviceListEnvvar: "NVIDIA_VISIBLE_DEVICES",
 deviceListStrategies: deviceListStrategies,
 socket: pluginapi.DevicePluginPath + "nvidia-" + name + ".sock",
 cdiHandler: cdiHandler,
 cdiEnabled: cdiEnabled,
 cdiAnnotationPrefix: *config.Flags.Plugin.CDIAnnotationPrefix,

// These will be reinitialized every
// time the plugin server is restarted.
 server: nil,
 health: nil,
 stop: nil,
 }
}

而在kubernetes當中如何發現GPU資源資訊主要是依賴于 **func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) **

源碼中

NvidiaDevicePlugin

對象實作了

pluginapi.DevicePluginServer

接口後，在Kubernetes叢集中被gRPC調用的。Kubernetes叢集通過gRPC協定向

NvidiaDevicePlugin

對象發送請求，包括

ListAndWatch

方法，以擷取裝置資源的清單資訊并持續監聽裝置資源的變化。在啟動裝置插件時，Kubernetes叢集将通過

plugin.Register

方法向裝置插件注冊，并在裝置插件啟動後啟動對

ListAndWatch

方法的調用，以便裝置資源得到有效的管理和配置設定。

// ListAndWatch lists devices and update that list according to the health status
func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
 s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})

for {
select {
case <-plugin.stop:
return nil
case d := <-plugin.health:
// FIXME: there is no way to recover from the Unhealthy state.
 d.Health = pluginapi.Unhealthy
 klog.Infof("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
 s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})
 }
 }
}

在這個檔案

internal/plugin/server.go

當中也能找到對應的rpc注冊源碼

// Serve starts the gRPC server of the device plugin.
func (plugin *NvidiaDevicePlugin) Serve() error {
 os.Remove(plugin.socket)
 sock, err := net.Listen("unix", plugin.socket)
if err != nil {
return err
 }

 pluginapi.RegisterDevicePluginServer(plugin.server, plugin)
 .......
}

3. 資源配置設定

使用

k8s-device-plugin

我們可以叢集當中的GPU做時間分片，在了解時分之前，我們先看看

Allocate

，Allocate的作用是為容器配置設定指定數量的GPU裝置資源，并傳回配置設定結果

該方法會首先從

reqs

參數中擷取容器的資源需求，然後通過

DeviceAssigner

對象的

Allocate

方法配置設定對應數量的GPU裝置資源。如果配置設定成功，則将配置設定結果封裝在

AllocateResponse

對象中傳回；如果配置設定失敗，則傳回錯誤資訊。在配置設定過程中，該方法還會對裝置進行時間分片等操作，以提高GPU裝置的使用率

// Allocate which return list of devices.
func (plugin *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
 responses := pluginapi.AllocateResponse{}
for _, req := range reqs.ContainerRequests {
// If the devices being allocated are replicas, then (conditionally)
// error out if more than one resource is being allocated.
if plugin.config.Sharing.TimeSlicing.FailRequestsGreaterThanOne && rm.AnnotatedIDs(req.DevicesIDs).AnyHasAnnotations() {
if len(req.DevicesIDs) > 1 {
return nil, fmt.Errorf("request for '%v: %v' too large: maximum request size for shared resources is 1", plugin.rm.Resource(), len(req.DevicesIDs))
 }
 }

for _, id := range req.DevicesIDs {
if !plugin.rm.Devices().Contains(id) {
return nil, fmt.Errorf("invalid allocation request for '%s': unknown device: %s", plugin.rm.Resource(), id)
 }
 }

 response, err := plugin.getAllocateResponse(req.DevicesIDs)
if err != nil {
return nil, fmt.Errorf("failed to get allocate response: %v", err)
 }
 responses.ContainerResponses = append(responses.ContainerResponses, response)
 }

return &responses, nil
}

plugin.getAllocateResponse(req.DevicesIDs)

func (plugin *NvidiaDevicePlugin) getAllocateResponse(requestIds []string) (*pluginapi.ContainerAllocateResponse, error) {
// 将請求ID轉化為對應的裝置ID清單，因為請求ID是由容器和GPU裝置名稱組成的字元串，需要進行解析和轉換，擷取到對應的裝置ID清單
 deviceIDs := plugin.deviceIDsFromAnnotatedDeviceIDs(requestIds)
// 調用getAllocateResponseForCDI方法為CDI的容器配置設定GPU資源并傳回響應。在這個方法中，容器請求的GPU資源會根據時間分片技術進行配置設定，以確定多個容器可以共享GPU資源。
 responseID := uuid.New().String()
 response, err := plugin.getAllocateResponseForCDI(responseID, deviceIDs)
if err != nil {
return nil, fmt.Errorf("failed to get allocate response for CDI: %v", err)
 }
// 根據NVIDIA Device Plugin配置中的環境變量政策，将配置設定的GPU裝置ID清單設定為容器的環境變量
if plugin.deviceListStrategies.Includes(spec.DeviceListStrategyEnvvar) {
 response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, deviceIDs)
 }
if plugin.deviceListStrategies.Includes(spec.DeviceListStrategyVolumeMounts) {
 response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, []string{deviceListAsVolumeMountsContainerPathRoot})
 response.Mounts = plugin.apiMounts(deviceIDs)
 }
// 根據NVIDIA Device Plugin配置中的卷挂載政策，将配置設定的GPU裝置ID清單設定為容器的環境變量和卷挂載
if *plugin.config.Flags.Plugin.PassDeviceSpecs {
 response.Devices = plugin.apiDeviceSpecs(*plugin.config.Flags.NvidiaDriverRoot, requestIds)
 }
// 根據NVIDIA Device Plugin配置中的裝置規範政策，将配置設定的GPU裝置ID清單設定為容器的裝置規範參數
if *plugin.config.Flags.GDSEnabled {
 response.Envs["NVIDIA_GDS"] = "enabled"
 }
if *plugin.config.Flags.MOFEDEnabled {
 response.Envs["NVIDIA_MOFED"] = "enabled"
 }

return &response, nil
}

Kubernetes使用NVIDIA的k8s-device-plugin原理分析

1. 環境準備

2. 源碼分析

3. 資源配置設定

繼續閱讀

169-建立service類 java項目實戰《盈利寶...

kubernetes基礎知識之service服務

在混合雲環境中，Kubernetes 可觀測性的 6 個有效政策

TypeError: merge is not a function問題解決辦法

idea grep-console 插件日志變色

「後端」聊聊微服務（Microservice）的那些事兒

@ControllerAdvice注解使用及原理探究

用qmake搭建架構之加載靜态、共享庫一、建立大型工程二、調用生成的庫檔案三、運作時加載共享庫關于插件關于打包

vue3 基礎-插件 plugin

Nacos源碼分析（五）用戶端服務發現原理分析

Teams App 如何使用裝置的能力

kubernetes基礎知識之ingress和External IP

容器與雲 | 分步指南：安裝和通路 Kubernetes 儀表闆

Eclipse 安裝 TestNG插件

Linux下安裝Vim插件YouCompleteMe

揭示 Kubernetes CPU 限制（和節流）