天天看点

Node Problem Detector什么是 NPD检测的问题为什么需要 NPDNPD 的组件NPD 架构部署测试Thanks

Have a nice day! ❤️

什么是 NPD

NPD(Node Problem Detector) 是一个可监控节点的健康状况并检测常见节点问题(例如硬件、内核或容器运行时问题等等)的开源项目

检测的问题

  • 基础设施问题, 例如 ntp down 了
  • 容器运行时问题,例如 runtime daemons 程序无响应
  • 硬件问题,例如 CPU、内存或磁盘故障
  • 内核问题,例如内核死锁情况或文件系统损坏

为什么需要 NPD

k8s 对于上述检测的问题不可见,当上述问题发生时,k8s 依旧会调度 pod 到那些"有问题的“节点上(这也不怪 k8s,他没看到这些问题,认为节点处于可调度的状态)。NPD 的出现就是为了将这些问题报告给 k8s, 让 k8s 可以看见、感受到节点出问题了,从而在调度的时候排除这些节点

NPD 的组件

Problem Daemon 和 Exporter 都是 NPD 程序的一个子程序

Problem Daemon

监控节点问题的一系列守护进程
  • SystemLogMonitor(系统日志监控器):监控系统日志,并根据预定的规则报告问题和指标
  • SystemStatsMonitor(系统状态监控器): 用于节点问题检测的系统统计监测器,收集各种与健康有关的系统统计信息作为指标
  • CustomPluginMonitor(自定义插件监控器): 通过定义的检查脚本调用和检查各种节点问题
  • HealthChecker(健康检查监控器): 用于检查 kubelet 和容器的运行时健康

Exporter

向后端报告节点问题、度量
  • Kubernetes exporter(Event, NodeCondition): 向 k8s API 报告节点的问题, 临时问题报告给 Event, 持久问题报告给 NodeCondition
  • Prometheus exporter: 将问题和 metrics 转换成兼容 prometheus 指标的格式,提供给 prometheus 采集
  • Stackdriver exporter: 将问题和 metrics 发送个 Stackdriver Monitoring API(google的一款集成监控、日志、跟踪管理的服务)

NPD 架构

Node Problem Detector什么是 NPD检测的问题为什么需要 NPDNPD 的组件NPD 架构部署测试Thanks

部署

helm 一键部署很奈斯,但倾向于手撸,选择了 yaml 文件部署,可工程中的部署文件许久没有更新了,连基本的 ServiceAccount,RBAC 这些都没有,单纯的简单修改,然后就 run,会遇见各种权限或其他问题,自己写各种 yaml 文件?还是怎么操作?是否有简单的方式,比如把 helm 的模板给转换成 yaml?

### helm 的 template 子命令能搞定,按照自己的需求修改 values.yaml
### https://github.com/deliveryhero/helm-charts/tree/master/stable/node-problem-detector
# helm template ./  --output-dir /tmp/npd
wrote /tmp/npd/node-problem-detector/templates/serviceaccount.yaml
wrote /tmp/npd/node-problem-detector/templates/custom-config-configmap.yaml
wrote /tmp/npd/node-problem-detector/templates/clusterrole.yaml
wrote /tmp/npd/node-problem-detector/templates/clusterrolebinding.yaml
wrote /tmp/npd/node-problem-detector/templates/service.yaml
wrote /tmp/npd/node-problem-detector/templates/daemonset.yaml
wrote /tmp/npd/node-problem-detector/templates/prometheusrule.yaml
wrote /tmp/npd/node-problem-detector/templates/servicemonitor.yaml
           
其他的部署过程略过。。。

测试

  • NodeConditions
可以看到有很多的

NodeConditions

了, k8s 默认只有

Ready

,

DiskPressure

,

MemoryPressure

,

PIDPressure

,

NetworkUnavailable

# kubectl get node node1 --output=custom-columns='TYPE:.status.conditions[*].type'
TYPE
KernelDeadlock,ContainerRuntimeUnhealthy,FrequentDockerRestart,FrequentContainerdRestart,KubeletUnhealthy,FrequentUnregisterNetDevice,CorruptDockerOverlay2,ReadonlyFilesystem,FrequentKubeletRestart,NetworkUnavailable,MemoryPressure,DiskPressure,PIDPressure,Ready
           
  • Events
可以看出事件已经进入了 Event
# echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg
# echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg
# kubectl get events -w
.....
0s          Warning   KernelOops               node/node1                           kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING
0s          Warning   TaskHung                 node/node1                           kernel: INFO: task docker:20744 blocked for more than 120 seconds.
           
  • 接口
### 10.233.71.57 随便找的一个 npd pod 的 ip
### 20256 ntp 进程的端口
### 20257 metrics 的端口
# curl -s 10.233.71.57:20256/conditions | jq
[
  {
    "type": "FrequentDockerRestart",
    "status": "False",
    "transition": "2022-10-21T13:57:07.838088391+08:00",
    "reason": "NoFrequentDockerRestart",
    "message": "docker is functioning properly"
  },
  {
    "type": "CorruptDockerOverlay2",
    "status": "False",
    "transition": "2022-10-21T13:57:07.857799203+08:00",
    "reason": "NoCorruptDockerOverlay2",
    "message": "docker overlay2 is functioning properly"
  },
  {
    "type": "ReadonlyFilesystem",
    "status": "False",
    "transition": "2022-10-21T13:57:07.838162889+08:00",
    "reason": "FilesystemIsNotReadOnly",
    "message": "Filesystem is not read-only"
  },
  {
    "type": "FrequentUnregisterNetDevice",
    "status": "False",
    "transition": "2022-10-21T13:57:07.838623439+08:00",
    "reason": "NoFrequentUnregisterNetDevice",
    "message": "node is functioning properly"
  },
  {
    "type": "KubeletUnhealthy",
    "status": "True",
    "transition": "2022-10-21T13:57:07.858331269+08:00",
    "reason": "KubeletUnhealthy",
    "message": "kubelet:kubelet was found unhealthy; repair flag : true"
  },
  {
    "type": "FrequentKubeletRestart",
    "status": "False",
    "transition": "2022-10-21T13:57:07.83808829+08:00",
    "reason": "NoFrequentKubeletRestart",
    "message": "kubelet is functioning properly"
  },
  {
    "type": "ContainerRuntimeUnhealthy",
    "status": "True",
    "transition": "2022-10-21T13:57:07.846599142+08:00",
    "reason": "DockerUnhealthy",
    "message": "docker:docker was found unhealthy; repair flag : true"
  },
  {
    "type": "FrequentContainerdRestart",
    "status": "False",
    "transition": "2022-10-21T13:57:07.838088457+08:00",
    "reason": "NoFrequentContainerdRestart",
    "message": "containerd is functioning properly"
  },
  {
    "type": "KernelDeadlock",
    "status": "False",
    "transition": "2022-10-21T13:57:07.838162818+08:00",
    "reason": "KernelHasNoDeadlock",
    "message": "kernel has no deadlock"
  }
]
# curl -s 10.233.71.57:20257/metrics
# HELP cpu_load_15m CPU average load (15m)
# TYPE cpu_load_15m gauge
cpu_load_15m 1.25
# HELP cpu_load_1m CPU average load (1m)
# TYPE cpu_load_1m gauge
cpu_load_1m 2.48
# HELP cpu_load_5m CPU average load (5m)
# TYPE cpu_load_5m gauge
cpu_load_5m 1.63
# HELP cpu_runnable_task_count The average number of runnable tasks in the run-queue during the last minute
# TYPE cpu_runnable_task_count gauge
cpu_runnable_task_count 2.48
# HELP cpu_usage_time CPU usage, in seconds
# TYPE cpu_usage_time counter
cpu_usage_time{state="guest"} 0
cpu_usage_time{state="guest_nice"} 0
cpu_usage_time{state="idle"} 3.200193903e+09
cpu_usage_time{state="iowait"} 6.9558739e+07
cpu_usage_time{state="irq"} 0
cpu_usage_time{state="nice"} 59504
cpu_usage_time{state="softirq"} 7.250425999999999e+06
cpu_usage_time{state="steal"} 0
cpu_usage_time{state="system"} 4.2156504e+07
cpu_usage_time{state="user"} 1.01992299e+08
# HELP disk_avg_queue_len The average queue length on the disk
# TYPE disk_avg_queue_len gauge
disk_avg_queue_len{device_name="loop0"} 0
disk_avg_queue_len{device_name="loop1"} 0
disk_avg_queue_len{device_name="loop2"} 0
disk_avg_queue_len{device_name="loop4"} 0
disk_avg_queue_len{device_name="loop5"} 0
disk_avg_queue_len{device_name="loop6"} 0
disk_avg_queue_len{device_name="loop7"} 0
disk_avg_queue_len{device_name="loop8"} 0
disk_avg_queue_len{device_name="loop9"} 0
disk_avg_queue_len{device_name="sda"} 0.1492731399187702
disk_avg_queue_len{device_name="sdb"} 0.015667346083479677
disk_avg_queue_len{device_name="sdc"} 0.0010667129248326588
disk_avg_queue_len{device_name="sr0"} 0
# HELP disk_bytes_used Disk bytes used, in Bytes
# TYPE disk_bytes_used gauge
disk_bytes_used{device_name="mapper/ubuntu--vg-ubuntu--lv",fs_type="ext4",mount_option="ro,relatime",state="free"} 9.8351378432e+10
disk_bytes_used{device_name="mapper/ubuntu--vg-ubuntu--lv",fs_type="ext4",mount_option="ro,relatime",state="used"} 2.8434952192e+10
# HELP disk_io_time The IO time spent on the disk, in ms
# TYPE disk_io_time counter
disk_io_time{device_name="loop0"} 3700
disk_io_time{device_name="loop1"} 4136
disk_io_time{device_name="loop2"} 8116
disk_io_time{device_name="loop4"} 3300
disk_io_time{device_name="loop5"} 4044
disk_io_time{device_name="loop6"} 8956
disk_io_time{device_name="loop7"} 16448
disk_io_time{device_name="loop8"} 1832
disk_io_time{device_name="loop9"} 1884
disk_io_time{device_name="sda"} 3.61836016e+08
disk_io_time{device_name="sdb"} 3.83064588e+08
disk_io_time{device_name="sdc"} 3.0174824e+08
disk_io_time{device_name="sr0"} 1396
# HELP disk_merged_operation_count Disk merged operations count
# TYPE disk_merged_operation_count counter
disk_merged_operation_count{device_name="loop0",direction="read"} 0
disk_merged_operation_count{device_name="loop0",direction="write"} 0
disk_merged_operation_count{device_name="loop1",direction="read"} 0
disk_merged_operation_count{device_name="loop1",direction="write"} 0
disk_merged_operation_count{device_name="loop2",direction="read"} 0
disk_merged_operation_count{device_name="loop2",direction="write"} 0
disk_merged_operation_count{device_name="loop4",direction="read"} 0
disk_merged_operation_count{device_name="loop4",direction="write"} 0
disk_merged_operation_count{device_name="loop5",direction="read"} 0
disk_merged_operation_count{device_name="loop5",direction="write"} 0
disk_merged_operation_count{device_name="loop6",direction="read"} 0
disk_merged_operation_count{device_name="loop6",direction="write"} 0
disk_merged_operation_count{device_name="loop7",direction="read"} 0
disk_merged_operation_count{device_name="loop7",direction="write"} 0
disk_merged_operation_count{device_name="loop8",direction="read"} 0
disk_merged_operation_count{device_name="loop8",direction="write"} 0
disk_merged_operation_count{device_name="loop9",direction="read"} 0
disk_merged_operation_count{device_name="loop9",direction="write"} 0
disk_merged_operation_count{device_name="sda",direction="read"} 50851
disk_merged_operation_count{device_name="sda",direction="write"} 7.281682e+07
disk_merged_operation_count{device_name="sdb",direction="read"} 1.2025967e+07
disk_merged_operation_count{device_name="sdb",direction="write"} 5.13174017e+08
disk_merged_operation_count{device_name="sdc",direction="read"} 1
disk_merged_operation_count{device_name="sdc",direction="write"} 2.0170223e+07
disk_merged_operation_count{device_name="sr0",direction="read"} 0
disk_merged_operation_count{device_name="sr0",direction="write"} 0
# HELP disk_operation_bytes_count Bytes transferred in disk operations
# TYPE disk_operation_bytes_count counter
disk_operation_bytes_count{device_name="loop0",direction="read"} 3.5198976e+07
disk_operation_bytes_count{device_name="loop0",direction="write"} 0
disk_operation_bytes_count{device_name="loop1",direction="read"} 3.7245952e+07
disk_operation_bytes_count{device_name="loop1",direction="write"} 0
disk_operation_bytes_count{device_name="loop2",direction="read"} 3.8960128e+07
disk_operation_bytes_count{device_name="loop2",direction="write"} 0
disk_operation_bytes_count{device_name="loop4",direction="read"} 3.8607872e+07
disk_operation_bytes_count{device_name="loop4",direction="write"} 0
disk_operation_bytes_count{device_name="loop5",direction="read"} 3.7970944e+07
disk_operation_bytes_count{device_name="loop5",direction="write"} 0
disk_operation_bytes_count{device_name="loop6",direction="read"} 4.9266688e+07
disk_operation_bytes_count{device_name="loop6",direction="write"} 0
disk_operation_bytes_count{device_name="loop7",direction="read"} 5.8686464e+07
disk_operation_bytes_count{device_name="loop7",direction="write"} 0
disk_operation_bytes_count{device_name="loop8",direction="read"} 3.282432e+07
disk_operation_bytes_count{device_name="loop8",direction="write"} 0
disk_operation_bytes_count{device_name="loop9",direction="read"} 2.1265408e+07
disk_operation_bytes_count{device_name="loop9",direction="write"} 0
disk_operation_bytes_count{device_name="sda",direction="read"} 1.303042048e+10
disk_operation_bytes_count{device_name="sda",direction="write"} 2.436338679808e+12
disk_operation_bytes_count{device_name="sdb",direction="read"} 1.444088193024e+12
disk_operation_bytes_count{device_name="sdb",direction="write"} 4.559270782464e+12
disk_operation_bytes_count{device_name="sdc",direction="read"} 6.7003392e+07
disk_operation_bytes_count{device_name="sdc",direction="write"} 7.94242318336e+11
disk_operation_bytes_count{device_name="sr0",direction="read"} 3.8854656e+07
disk_operation_bytes_count{device_name="sr0",direction="write"} 0
# HELP disk_operation_count Disk operations count
# TYPE disk_operation_count counter
disk_operation_count{device_name="loop0",direction="read"} 437
disk_operation_count{device_name="loop0",direction="write"} 0
disk_operation_count{device_name="loop1",direction="read"} 444
disk_operation_count{device_name="loop1",direction="write"} 0
disk_operation_count{device_name="loop2",direction="read"} 1861
disk_operation_count{device_name="loop2",direction="write"} 0
disk_operation_count{device_name="loop4",direction="read"} 406
disk_operation_count{device_name="loop4",direction="write"} 0
disk_operation_count{device_name="loop5",direction="read"} 407
disk_operation_count{device_name="loop5",direction="write"} 0
disk_operation_count{device_name="loop6",direction="read"} 24129
disk_operation_count{device_name="loop6",direction="write"} 0
disk_operation_count{device_name="loop7",direction="read"} 21295
disk_operation_count{device_name="loop7",direction="write"} 0
disk_operation_count{device_name="loop8",direction="read"} 341
disk_operation_count{device_name="loop8",direction="write"} 0
disk_operation_count{device_name="loop9",direction="read"} 19064
disk_operation_count{device_name="loop9",direction="write"} 0
disk_operation_count{device_name="sda",direction="read"} 435885
disk_operation_count{device_name="sda",direction="write"} 7.5885379e+07
disk_operation_count{device_name="sdb",direction="read"} 2.3067801e+07
disk_operation_count{device_name="sdb",direction="write"} 2.58998356e+08
disk_operation_count{device_name="sdc",direction="read"} 5285
disk_operation_count{device_name="sdc",direction="write"} 1.48210393e+08
disk_operation_count{device_name="sr0",direction="read"} 1155
disk_operation_count{device_name="sr0",direction="write"} 0
# HELP disk_operation_time Time spent in disk operations, in ms
# TYPE disk_operation_time counter
disk_operation_time{device_name="loop0",direction="read"} 3056
disk_operation_time{device_name="loop0",direction="write"} 0
disk_operation_time{device_name="loop1",direction="read"} 3332
disk_operation_time{device_name="loop1",direction="write"} 0
disk_operation_time{device_name="loop2",direction="read"} 98393
disk_operation_time{device_name="loop2",direction="write"} 0
disk_operation_time{device_name="loop4",direction="read"} 3270
disk_operation_time{device_name="loop4",direction="write"} 0
disk_operation_time{device_name="loop5",direction="read"} 4736
disk_operation_time{device_name="loop5",direction="write"} 0
disk_operation_time{device_name="loop6",direction="read"} 128687
disk_operation_time{device_name="loop6",direction="write"} 0
disk_operation_time{device_name="loop7",direction="read"} 405324
disk_operation_time{device_name="loop7",direction="write"} 0
disk_operation_time{device_name="loop8",direction="read"} 820
disk_operation_time{device_name="loop8",direction="write"} 0
disk_operation_time{device_name="loop9",direction="read"} 1186
disk_operation_time{device_name="loop9",direction="write"} 0
disk_operation_time{device_name="sda",direction="read"} 3.5851744e+07
disk_operation_time{device_name="sda",direction="write"} 2.135674009e+09
disk_operation_time{device_name="sdb",direction="read"} 7.54775323e+08
disk_operation_time{device_name="sdb",direction="write"} 1.77503746e+08
disk_operation_time{device_name="sdc",direction="read"} 297934
disk_operation_time{device_name="sdc",direction="write"} 2.12061369e+08
disk_operation_time{device_name="sr0",direction="read"} 185
disk_operation_time{device_name="sr0",direction="write"} 0
# HELP disk_weighted_io The weighted IO on the disk, in ms
# TYPE disk_weighted_io counter
disk_weighted_io{device_name="loop0"} 2924
disk_weighted_io{device_name="loop1"} 3192
disk_weighted_io{device_name="loop2"} 96536
disk_weighted_io{device_name="loop4"} 3124
disk_weighted_io{device_name="loop5"} 4588
disk_weighted_io{device_name="loop6"} 124600
disk_weighted_io{device_name="loop7"} 396628
disk_weighted_io{device_name="loop8"} 784
disk_weighted_io{device_name="loop9"} 0
disk_weighted_io{device_name="sda"} 2.148574924e+09
disk_weighted_io{device_name="sdb"} 8.45756244e+08
disk_weighted_io{device_name="sdc"} 1.81354428e+08
disk_weighted_io{device_name="sr0"} 24
# HELP host_uptime The uptime of the operating system
# TYPE host_uptime gauge
host_uptime{kernel_version="5.4.0-125-generic",os_version="debian 10 (buster)"} 4.318465e+06
# HELP memory_anonymous_used Anonymous memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
# TYPE memory_anonymous_used gauge
memory_anonymous_used{state="active"} 5.126115328e+09
memory_anonymous_used{state="inactive"} 884736
# HELP memory_bytes_used Memory usage by each memory state, in Bytes. Summing values of all states yields the total memory on the node.
# TYPE memory_bytes_used gauge
memory_bytes_used{state="buffered"} 1.913339904e+09
memory_bytes_used{state="cached"} 7.10950912e+09
memory_bytes_used{state="free"} 4.56556544e+08
memory_bytes_used{state="slab"} 9.78538496e+08
memory_bytes_used{state="used"} 6.32797184e+09
# HELP memory_dirty_used Dirty pages usage, in Bytes. Dirty means the memory is waiting to be written back to disk, and writeback means the memory is actively being written back to disk.
# TYPE memory_dirty_used gauge
memory_dirty_used{state="dirty"} 217088
memory_dirty_used{state="writeback"} 4096
# HELP memory_page_cache_used Page cache memory usage, in Bytes. Summing values of all states yields the total anonymous memory used.
# TYPE memory_page_cache_used gauge
memory_page_cache_used{state="active"} 4.081680384e+09
memory_page_cache_used{state="inactive"} 5.916962816e+09
# HELP memory_unevictable_used Unevictable memory usage, in Bytes
# TYPE memory_unevictable_used gauge
memory_unevictable_used 1.9603456e+07
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="CCPPCrash"} 0
problem_counter{reason="CorruptDockerImage"} 0
problem_counter{reason="CorruptDockerOverlay2"} 0
problem_counter{reason="DockerContainerStartupFailure"} 0
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerUnhealthy"} 1
problem_counter{reason="Ext4Error"} 0
problem_counter{reason="Ext4Warning"} 0
problem_counter{reason="FilesystemIsReadOnly"} 0
problem_counter{reason="FrequentContainerdRestart"} 0
problem_counter{reason="FrequentDockerRestart"} 0
problem_counter{reason="FrequentKubeletRestart"} 0
problem_counter{reason="IOError"} 0
problem_counter{reason="KernelOops"} 0
problem_counter{reason="Kerneloops"} 0
problem_counter{reason="KubeletUnhealthy"} 1
problem_counter{reason="MemoryReadError"} 0
problem_counter{reason="OOMKilling"} 0
problem_counter{reason="TaskHung"} 0
problem_counter{reason="UncaughtException"} 0
problem_counter{reason="UnregisterNetDevice"} 0
problem_counter{reason="VMcore"} 0
problem_counter{reason="XorgCrash"} 0
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CorruptDockerOverlay2",type="CorruptDockerOverlay2"} 0
problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0
problem_gauge{reason="DockerUnhealthy",type="ContainerRuntimeUnhealthy"} 1
problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0
problem_gauge{reason="FrequentContainerdRestart",type="FrequentContainerdRestart"} 0
problem_gauge{reason="FrequentDockerRestart",type="FrequentDockerRestart"} 0
problem_gauge{reason="FrequentKubeletRestart",type="FrequentKubeletRestart"} 0
problem_gauge{reason="KubeletUnhealthy",type="KubeletUnhealthy"} 1
problem_gauge{reason="NoFrequentContainerdRestart",type="FrequentContainerdRestart"} 0
problem_gauge{reason="NoFrequentDockerRestart",type="FrequentDockerRestart"} 0
problem_gauge{reason="NoFrequentKubeletRestart",type="FrequentKubeletRestart"} 0
problem_gauge{reason="NoFrequentUnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
# HELP system_cpu_stat Cumulative time each cpu spent in various stages.
# TYPE system_cpu_stat counter
system_cpu_stat{cpu="cpu0",stage="guest"} 0
system_cpu_stat{cpu="cpu0",stage="guestNice"} 0
system_cpu_stat{cpu="cpu0",stage="iRQ"} 0
system_cpu_stat{cpu="cpu0",stage="idle"} 2.803354758e+07
system_cpu_stat{cpu="cpu0",stage="iowait"} 602734.6900000001
system_cpu_stat{cpu="cpu0",stage="nice"} 456.33
system_cpu_stat{cpu="cpu0",stage="softIRQ"} 88822.09999999999
system_cpu_stat{cpu="cpu0",stage="steal"} 0
system_cpu_stat{cpu="cpu0",stage="system"} 371243.26
system_cpu_stat{cpu="cpu0",stage="user"} 893459.82
system_cpu_stat{cpu="cpu1",stage="guest"} 0
system_cpu_stat{cpu="cpu1",stage="guestNice"} 0
system_cpu_stat{cpu="cpu1",stage="iRQ"} 0
system_cpu_stat{cpu="cpu1",stage="idle"} 2.795526287e+07
system_cpu_stat{cpu="cpu1",stage="iowait"} 627008.6000000001
system_cpu_stat{cpu="cpu1",stage="nice"} 574.2099999999999
system_cpu_stat{cpu="cpu1",stage="softIRQ"} 84402.84
system_cpu_stat{cpu="cpu1",stage="steal"} 0
system_cpu_stat{cpu="cpu1",stage="system"} 367280.31999999995
system_cpu_stat{cpu="cpu1",stage="user"} 890996.12
system_cpu_stat{cpu="cpu2",stage="guest"} 0
system_cpu_stat{cpu="cpu2",stage="guestNice"} 0
system_cpu_stat{cpu="cpu2",stage="iRQ"} 0
system_cpu_stat{cpu="cpu2",stage="idle"} 2.80118165e+07
system_cpu_stat{cpu="cpu2",stage="iowait"} 601355.44
system_cpu_stat{cpu="cpu2",stage="nice"} 554.61
system_cpu_stat{cpu="cpu2",stage="softIRQ"} 60500.49
system_cpu_stat{cpu="cpu2",stage="steal"} 0
system_cpu_stat{cpu="cpu2",stage="system"} 369550.51999999996
system_cpu_stat{cpu="cpu2",stage="user"} 895233.5700000001
system_cpu_stat{cpu="cpu3",stage="guest"} 0
system_cpu_stat{cpu="cpu3",stage="guestNice"} 0
system_cpu_stat{cpu="cpu3",stage="iRQ"} 0
system_cpu_stat{cpu="cpu3",stage="idle"} 2.8032863539999995e+07
system_cpu_stat{cpu="cpu3",stage="iowait"} 598279.64
system_cpu_stat{cpu="cpu3",stage="nice"} 484.19000000000005
system_cpu_stat{cpu="cpu3",stage="softIRQ"} 46423.88
system_cpu_stat{cpu="cpu3",stage="steal"} 0
system_cpu_stat{cpu="cpu3",stage="system"} 370835.82999999996
system_cpu_stat{cpu="cpu3",stage="user"} 894153.73
system_cpu_stat{cpu="cpu4",stage="guest"} 0
system_cpu_stat{cpu="cpu4",stage="guestNice"} 0
system_cpu_stat{cpu="cpu4",stage="iRQ"} 0
system_cpu_stat{cpu="cpu4",stage="idle"} 2.7875754340000004e+07
system_cpu_stat{cpu="cpu4",stage="iowait"} 640530.93
system_cpu_stat{cpu="cpu4",stage="nice"} 446.6
system_cpu_stat{cpu="cpu4",stage="softIRQ"} 97673.76999999999
system_cpu_stat{cpu="cpu4",stage="steal"} 0
system_cpu_stat{cpu="cpu4",stage="system"} 361362.44
system_cpu_stat{cpu="cpu4",stage="user"} 884454.93
system_cpu_stat{cpu="cpu5",stage="guest"} 0
system_cpu_stat{cpu="cpu5",stage="guestNice"} 0
system_cpu_stat{cpu="cpu5",stage="iRQ"} 0
system_cpu_stat{cpu="cpu5",stage="idle"} 2.802950592e+07
system_cpu_stat{cpu="cpu5",stage="iowait"} 605046.56
system_cpu_stat{cpu="cpu5",stage="nice"} 518
system_cpu_stat{cpu="cpu5",stage="softIRQ"} 44329.9
system_cpu_stat{cpu="cpu5",stage="steal"} 0
system_cpu_stat{cpu="cpu5",stage="system"} 370232.26
system_cpu_stat{cpu="cpu5",stage="user"} 892967.6399999999
system_cpu_stat{cpu="cpu6",stage="guest"} 0
system_cpu_stat{cpu="cpu6",stage="guestNice"} 0
system_cpu_stat{cpu="cpu6",stage="iRQ"} 0
system_cpu_stat{cpu="cpu6",stage="idle"} 2.803386166e+07
system_cpu_stat{cpu="cpu6",stage="iowait"} 591032.88
system_cpu_stat{cpu="cpu6",stage="nice"} 532.14
system_cpu_stat{cpu="cpu6",stage="softIRQ"} 41689.9
system_cpu_stat{cpu="cpu6",stage="steal"} 0
system_cpu_stat{cpu="cpu6",stage="system"} 370601.5399999999
system_cpu_stat{cpu="cpu6",stage="user"} 893875
system_cpu_stat{cpu="cpu7",stage="guest"} 0
system_cpu_stat{cpu="cpu7",stage="guestNice"} 0
system_cpu_stat{cpu="cpu7",stage="iRQ"} 0
system_cpu_stat{cpu="cpu7",stage="idle"} 2.803252314e+07
system_cpu_stat{cpu="cpu7",stage="iowait"} 602404.69
system_cpu_stat{cpu="cpu7",stage="nice"} 598.8499999999999
system_cpu_stat{cpu="cpu7",stage="softIRQ"} 43641.26
system_cpu_stat{cpu="cpu7",stage="steal"} 0
system_cpu_stat{cpu="cpu7",stage="system"} 369628.15
system_cpu_stat{cpu="cpu7",stage="user"} 893830.4199999999
# HELP system_interrupts_total Total number of interrupts serviced (cumulative).
# TYPE system_interrupts_total counter
system_interrupts_total 3.07863006363e+11
# HELP system_os_feature OS Features like GPU support, KTD kernel, third party modules as unknown modules. 1 if the feature is enabled and 0, if disabled.
# TYPE system_os_feature gauge
system_os_feature{os_feature="GPUSupport",value=""} 0
system_os_feature{os_feature="KTD",value=""} 0
system_os_feature{os_feature="KernelModuleIntegrity",value=""} 0
system_os_feature{os_feature="UnifiedCgroupHierarchy",value=""} 0
system_os_feature{os_feature="UnknownModules",value=""} 0
# HELP system_processes_total Number of forks since boot.
# TYPE system_processes_total counter
system_processes_total 7.54488744e+08
# HELP system_procs_blocked Number of processes currently blocked.
# TYPE system_procs_blocked gauge
system_procs_blocked 1
# HELP system_procs_running Number of processes currently running.
# TYPE system_procs_running gauge
system_procs_running 2

           

Thanks

Node Problem Detector

NPD HELM

Google Node Problem Detector

Node Condition