嘗鮮阿裡雲容器服務Kubernetes 1.16，共享TensorFlow實驗室

簡介

TensorFLow是深度學習和機器學習最流行的開源架構，它最初是由Google研究團隊開發的并緻力于解決深度神經網絡的機器學習研究，從2015年開源到現在得到了廣泛的應用。特别是Tensorboard這一利器，對于資料科學家有效的工作也是非常有效的利器。

Jupyter notebook是強大的資料分析工具，它能夠幫助快速開發并且實作機器學習代碼的共享，是資料科學團隊用來做資料實驗群組内合作的利器，也是機器學習初學者入門這一個領域的好起點。

利用Jupyter開發TensorFlow也是許多資料科學家的首選，但是如何能夠快速從零搭建一套這樣的環境，并且配置GPU的使用，同時支援最新的TensorFlow版本, 對于資料科學家來說既是複雜的，同時也是浪費精力的。

在Kubernetes叢集上，您可以快速的部署一套完整Jupyter Notebook環境，進行模型開發。這個方案唯一的問題在于這裡的GPU資源是獨享，造成較大的浪費。資料科學家使用notebook實驗的時候GPU顯存需求量并不大，如果可以能夠多人共享同一個GPU可以降低模型開發的成本。

而阿裡雲容器服務團隊推出了GPU共享方案，可以在模型開發和模型推理的場景下大大提升GPU資源的使用率，同時也可以保障GPU資源的隔離。

獨享GPU的處理辦法

首先我們回顧下以前排程GPU的情況

為叢集添加一個新的gpu節點

建立容器服務叢集
添加GPU節點作為worker

本例中我們選擇GPU機器規格“ecs.gn6i-c4g1.xlarge”

添加後結果如下"cn-zhangjiakou.192.168.3.189"

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get node -L cgpu,workload_type
NAME                           STATUS   ROLES    AGE     VERSION            CGPU   WORKLOAD_TYPE
cn-zhangjiakou.192.168.0.138   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113   Ready    <none>   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189   Ready    <none>   5m52s   v1.16.6-aliyun.1

部署應用

通過指令

kubectl apply -f gpu_deployment.yaml

來部署應用，

gpu_deployment.yaml

檔案内容如下

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-gpu
  labels:
    app: tf-notebook-gpu
spec:
  replicas: 2
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-gpu
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-gpu
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-gpu
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-gpu
  type: LoadBalancer

因為隻有一個GPU節點，而上面的yaml檔案中申請了兩個Pod，我們看到如下pod的排程情況，

可以看到第二個pod的狀态是pending，原因是無對應資源來進行排程，也即是說隻能一個Pod“獨占”該節點的GPU資源。

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod
NAME                               READY   STATUS    RESTARTS   AGE
tf-notebook-2-7b4d68d8f7-mb852     1/1     Running   0          15h
tf-notebook-3-86c48d4c7d-flz7m     1/1     Running   0          15h
tf-notebook-7cf4575d78-sxmfl       1/1     Running   0          23h
tf-notebook-gpu-695cb6cf89-dsjmv   1/1     Running   0          6s
tf-notebook-gpu-695cb6cf89-mwm98   0/1     Pending   0          6s
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl describe pod tf-notebook-gpu-695cb6cf89-mwm98
Name:           tf-notebook-gpu-695cb6cf89-mwm98
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=tf-notebook-gpu
                pod-template-hash=695cb6cf89
Annotations:    kubernetes.io/psp: ack.privileged
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/tf-notebook-gpu-695cb6cf89
Containers:
  tf-notebook:
    Image:      tensorflow/tensorflow:1.4.1-gpu-py3
    Port:       8888/TCP
    Host Port:  0/TCP
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      PASSWORD:  mypassw0rd
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-wpwn8 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-wpwn8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-wpwn8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   <unknown>          default-scheduler   0/6 nodes are available: 6 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   <unknown>          default-scheduler   0/6 nodes are available: 6 Insufficient nvidia.com/gpu.

真實的程式

在jupyter裡執行下面的程式

import argparse

import tensorflow as tf

FLAGS = None

def train(fraction=1.0):
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction

    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction
    sess = tf.Session(config=config)
    # Runs the op.
    while True:
        sess.run(c)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--total', type=float, default=1000,
                      help='Total GPU memory.')
    parser.add_argument('--allocated', type=float, default=1000,
                      help='Allocated GPU memory.')
    FLAGS, unparsed = parser.parse_known_args()
    # fraction = FLAGS.allocated / FLAGS.total * 0.85
    fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 )

    print(fraction) # fraction 預設值為0.7，該程式最多使用總資源的70%
    train(fraction)

通過托管版本Prometheus可以看到，在運作時其使用了整機資源的70%，

嘗鮮阿裡雲容器服務Kubernetes 1.16，共享TensorFlow實驗室

獨享GPU方案的問題

綜上所述，獨享GPU排程方案存在的問題是在推理、教學等對GPU用量不大的場景中不能将更多的Pod排程在一起，完成GPU的共享

為了解決這些問題我們引入了GPU共享的方案，以便更好的利用GPU資源，提供更密集的部署能力、更高的GPU使用率、完整的隔離能力。

GPU共享方案

環境準備

前提條件

配置	支援版本
Kubernetes	1.16.06；專屬叢集-master節點需要在客戶的VPC内
Helm版本	3.0及以上版本
Nvidia驅動版本	418.87.01及以上版本
Docker版本	19.03.5
作業系統	CentOS 7.6、CentOS 7.7、Ubuntu 16.04和Ubuntu 18.04
支援顯示卡	Telsa P4、Telsa P100、 Telsa T4和Telsa v100（16GB）

建立叢集

添加GPU節點

本文中使用的GPU節點規格為 ecs.gn6i-c4g1.xlarge

設定節點為GPU共享節點--為GPU節點打标

登入容器服務管理控制台。
在控制台左側導航欄中，選擇叢集 > 節點
在節點清單頁面，選擇目标叢集并單擊頁面右上角标簽管理。
在标簽管理頁面，批量選擇節點，然後單擊添加标簽。

https://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/zh-CN/8783600951/p101294.png

在彈出的添加對話框中，填寫标簽名稱和值。

注意請確定名稱設定為cgpu，值設定為true。

https://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/zh-CN/8783600951/p101645.png

單擊确定。

為叢集安裝CGPU元件

在控制台左側導航欄中，選擇市場 > 應用目錄。
在應用目錄頁面，選中并單擊ack-cgpu。
在應用目錄-ack-cgpu頁面右側的建立面闆中，選中目标叢集，然後單擊建立。您無需設定命名空間和釋出名稱，系統顯示預設值。 https://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/zh-CN/9783600951/p101994.png 您可以執行指令 helm get manifest cgpu -n kube-system | kubectl get -f - 檢視cGPU元件是否安裝成功。當出現以下指令詳情時，說明cGPU元件安裝成功。

# helm get manifest cgpu -n kube-system | kubectl get -f -
NAME                                    SECRETS   AGE
serviceaccount/gpushare-device-plugin   1         39s
serviceaccount/gpushare-schd-extender   1         39s
NAME                                                           AGE
clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin   39s
clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender   39s
NAME                                                                  AGE
clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin   39s
clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender   39s
NAME                             TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
service/gpushare-schd-extender   NodePort   10.6.13.125   <none>        12345:32766/TCP   39s
NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
daemonset.apps/cgpu-installer              4         4         4       4            4           cgpu=true        39s
daemonset.apps/device-plugin-evict-ds      4         4         4       4            4           cgpu=true        39s
daemonset.apps/device-plugin-recover-ds    0         0         0       0            0           cgpu=false   39s
daemonset.apps/gpushare-device-plugin-ds   4         4         4       4            4           cgpu=true        39s
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpushare-schd-extender   1/1     1            1           38s
NAME                           COMPLETIONS   DURATION   AGE
job.batch/gpushare-installer   3/1 of 3      3s         38s

安裝arena檢視資源情況

安裝arena

@ linux

wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz
tar -xzvf arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz
sh ./arena-installer/install.sh

@ mac

wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz
tar -xzvf arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz
sh ./arena-installer/install.sh

檢視資源情況

jumper(⎈ |zjk-gpu:default)➜  ~ arena top node
NAME                          IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)  GPU(Shareable)
cn-zhangjiakou.192.168.0.138  192.168.0.138  master  ready   0           0               No
cn-zhangjiakou.192.168.1.112  192.168.1.112  master  ready   0           0               No
cn-zhangjiakou.192.168.1.113  192.168.1.113  <none>  ready   0           0               No
cn-zhangjiakou.192.168.3.115  192.168.3.115  master  ready   0           0               No
cn-zhangjiakou.192.168.3.184  192.168.3.184  <none>  ready   1           0               Yes
------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s
NAME                          IPADDRESS      GPU0(Allocated/Total)
cn-zhangjiakou.192.168.3.184  192.168.3.184  0/14
---------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
0/14 (GiB) (0%)

如上所示

節點cn-zhangjiakou.192.168.3.184 有1個GPU資源, 設定了 GPU(Shareable)--即在節點上打标簽cgpu=true，其上有14個顯存資源

運作TensorFLow的GPU實驗環境

将如下檔案存儲為 mem_deployment.yaml，通過kubectl執行

kubectl apply -f mem_deployment.yaml

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook
  type: LoadBalancer

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment.yaml
deployment.apps/tf-notebook created
service/tf-notebook created
jumper(⎈ |zjk-gpu:default)➜  ~  kubectl get svc tf-notebook
NAME          TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
tf-notebook   LoadBalancer   172.21.2.50   39.100.193.19   80:32285/TCP   78m

通路

http://

${EXTERNAL-IP}/ 來通路目标

Deployment配置：

nvidia.com/gpu 指定調用nvidia gpu的數量

type=LoadBalancer 指定使用阿裡雲的負載均衡通路内部服務和負載均衡

環境變量 PASSWORD 指定了通路Jupyter服務的密碼，您可以按照您的需要修改，預設“mypassw0rd”

現在要驗證這個Jupyter執行個體可以使用GPU，可以在運作下面的程式。它将列出Tensorflow可用的所有裝置。

from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print(get_available_devices())

可以看到如下輸出，資源位GPU:0

在首頁建立新的terminal

執行

nvidia-smi

可以看到在Pod上資源上限是4308MiB

驗證GPU資源的共享

以上部分可以看出新的資源“aliyun.com/gpu-mem: 4”可以正常的申請的GPU資源，并運作對應的GPU任務，下面來看GPU資源共享的情況。

資源使用情況檢視

首先，現有資源使用情況如下

arena top node -s -d

jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s -d

NAME:       cn-zhangjiakou.192.168.3.184
IPADDRESS:  192.168.3.184

NAME                            NAMESPACE  GPU0(Allocated)
tf-notebook-2-7b4d68d8f7-wxlff  default    4
tf-notebook-3-86c48d4c7d-lk9h8  default    4
tf-notebook-7cf4575d78-9gxzd    default    4
Allocated :                     12 (85%)
Total :                         14
--------------------------------------------------------------------------------------------------------------------------------------


Allocated/Total GPU Memory In GPUShare Node:
12/14 (GiB) (85%)

如上所示每個節點顯存資源為14，可以排程3個pod.

部署更多的服務和副本

為了每個notebook能夠有自己的入口，我們申請三個服務，指向三個pod,yaml檔案如下

ps: mem_deployment-2.yaml、mem_deployment-3.yaml與mem_deployment.yaml内容幾乎一緻，隻是把不同的svc指向不同的pod

mem_deployment-2.yaml

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-2
  labels:
    app: tf-notebook-2
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-2
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-2
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-2
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-2
  type: LoadBalancer

mem_deployment-3.yaml

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-3
  labels:
    app: tf-notebook-3
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-3
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-3
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-3
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-3
  type: LoadBalancer

應用兩個yaml檔案，加上之前部署的pod和服務共計在叢集上部署3個Pod和3個服務

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment-2.yaml
deployment.apps/tf-notebook-2 created
service/tf-notebook-2 created
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment-3.yaml
deployment.apps/tf-notebook-3 created
service/tf-notebook-3 created
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get svc
NAME            TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
kubernetes      ClusterIP      172.21.0.1    <none>          443/TCP        11d
tf-notebook     LoadBalancer   172.21.2.50   39.100.193.19   80:32285/TCP   7h48m
tf-notebook-2   LoadBalancer   172.21.1.46   39.99.218.255   80:30659/TCP   8m53s
tf-notebook-3   LoadBalancer   172.21.8.56   39.98.242.180   80:31274/TCP   7s
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
tf-notebook-2-7b4d68d8f7-mb852   1/1     Running   0          9m6s    172.20.64.21   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-3-86c48d4c7d-flz7m   1/1     Running   0          20s     172.20.64.22   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-sxmfl     1/1     Running   0          7h49m   172.20.64.14   cn-zhangjiakou.192.168.3.184   <none>           <none>
jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s
NAME                          IPADDRESS      GPU0(Allocated/Total)
cn-zhangjiakou.192.168.3.184  192.168.3.184  12/14
----------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
12/14 (GiB) (85%)

檢視最終結果

通過

kubectl get pod -o wide

可以看到在cn-zhangjiakou.192.168.3.184 節點上有3個pod運作

arena top node -s

可以看到cn-zhangjiakou.192.168.3.184節點上的顯存資源使用了 12/14

在不同的服務上開啟終端，通過nvidia-smi來檢視GPU資源，每個Pod的上限都是4308MiB

在節點cn-zhangjiakou.192.168.3.184 上運作如下指令，檢視節點上的資源情況

[root@iZ8vb4lox93w3mhkqmdrgsZ ~]# nvidia-smi
Wed May 27 12:19:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   49C    P0    29W /  70W |   4019MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11563      C   /usr/bin/python3                            4009MiB |
+-----------------------------------------------------------------------------+

由此可以看出通過使用cgpu的模式可以在同一個節點上部署更多的使用GPU資源的Pod，而“普通的排程一個GPU node 隻能負載一個pod”

下面是一段可以持續運作使用GPU資源的代碼，其中參數fraction 為申請顯存占可用顯存的比例，預設值為0.7，我們在3個pod的Jupyter裡運作下面的程式

import argparse

import tensorflow as tf

FLAGS = None

def train(fraction=1.0):
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction

    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction
    sess = tf.Session(config=config)
    # Runs the op.
    while True:
        sess.run(c)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--total', type=float, default=1000,
                      help='Total GPU memory.')
    parser.add_argument('--allocated', type=float, default=1000,
                      help='Allocated GPU memory.')
    FLAGS, unparsed = parser.parse_known_args()
    # fraction = FLAGS.allocated / FLAGS.total * 0.85
    fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 )

    print(fraction) # fraction 預設值為0.7，該程式最多使用總資源的70%
    train(fraction)

然後通過托管版Prometheus來觀察具體的資源使用情況

如上圖所示，每個Pod實際使用顯存3.266GB，亦即每個Pod的使用的顯存資源都限制到了4

總結

總結一下

通過給節點添加cgpu: true标簽将節點設定為GPU共享型節點。
在pod中通過類型 aliyun.com/gpu-mem: 4 的資源來申請和限制單個pod使用的資源，進而達到GPU共享的目的，每個pod都可以提供完整的GPU能力; 而Node上的一個GPU資源分享給了3個Pod使用，使用率提升到300% -- 如果資源拆分更小，還可以達到更高的使用率。
arena top node 、 arena top node -s 來檢視GPU資源配置設定的情況
通過托管版Prometheus的“GPU APP” 大盤可以看到實際運作時使用的顯存、GPU、溫度、功率等資訊。

參考資訊

托管版本Prometheus

https://help.aliyun.com/document_detail/122123.html

GPU共享方案CGPU

https://help.aliyun.com/document_detail/163994.html

arena

https://github.com/kubeflow/arena