簡介
首先,請閱讀文章《Kubernetes監控手冊01-體系介紹》,回顧一下 Kubernetes 架構,Kube-Proxy 是在所有工作負載節點上的。
Kube-Proxy 預設暴露兩個端口,10249用于暴露監控名額,在
/metrics
接口吐出 Prometheus 協定的監控資料:
[[email protected] lib]# curl -s http://localhost:10249/metrics | head -n 10
# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.5307e-05
go_gc_duration_seconds{quantile="0.25"} 2.8884e-05
10256 端口作為健康檢查的端口,使用
/healthz
接口做健康檢查,請求之後傳回兩個時間資訊:
[[email protected] lib]# curl -s http://localhost:10256/healthz | jq .
{
"lastUpdated": "2022-11-09 13:14:35.621317865 +0800 CST m=+4802354.950616250",
"currentTime": "2022-11-09 13:14:35.621317865 +0800 CST m=+4802354.950616250"
}
是以,我們隻要從
http://localhost:10249/metrics
采集監控資料即可。既然是 Prometheus 協定的資料,使用 Categraf 的 input.prometheus 來搞定即可。
Categraf prometheus 插件
配置檔案在
conf/input.prometheus/prometheus.toml
,把 Kube-Proxy 的位址配置進來即可:
interval = 15
[[instances]]
urls = [
"http://localhost:10249/metrics"
]
labels = { job="kube-proxy" }
urls 字段配置 endpoint 清單,即所有提供 metrics 資料的接口,我們使用下面的指令做個測試:
[[email protected] categraf]$ ./categraf --test --inputs prometheus | grep kubeproxy_sync_proxy_rules
2022/11/09 13:30:17 main.go:110: I! runner.binarydir: /home/work/go/src/categraf
2022/11/09 13:30:17 main.go:111: I! runner.hostname: tt-fc-dev01.nj
2022/11/09 13:30:17 main.go:112: I! runner.fd_limits: (soft=655360, hard=655360)
2022/11/09 13:30:17 main.go:113: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2022/11/09 13:30:17 config.go:33: I! tracing disabled
2022/11/09 13:30:17 provider.go:63: I! use input provider: [local]
2022/11/09 13:30:17 agent.go:87: I! agent starting
2022/11/09 13:30:17 metrics_agent.go:93: I! input: local.prometheus started
2022/11/09 13:30:17 prometheus_scrape.go:14: I! prometheus scraping disabled!
2022/11/09 13:30:17 agent.go:98: I! agent started
13:30:17 kubeproxy_sync_proxy_rules_endpoint_changes_pending agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_count agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 319786
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_sum agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 17652.749911909214
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=+Inf 319786
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.001 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.002 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.004 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.008 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.016 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.032 0
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.064 274815
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.128 316616
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.256 319525
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.512 319776
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=1.024 319784
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=2.048 319784
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=4.096 319784
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=8.192 319784
13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=16.384 319786
13:30:17 kubeproxy_sync_proxy_rules_service_changes_pending agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
13:30:17 kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 1.6668536394083393e+09
13:30:17 kubeproxy_sync_proxy_rules_iptables_restore_failures_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
13:30:17 kubeproxy_sync_proxy_rules_endpoint_changes_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 219139
13:30:17 kubeproxy_sync_proxy_rules_last_timestamp_seconds agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 1.6679718066295934e+09
13:30:17 kubeproxy_sync_proxy_rules_service_changes_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 512372
Kube-Proxy 在 Kubernetes 架構中,負責從 APIServer 同步規則,然後修改 iptables 或 ipvs 配置,同步規則相關的名額就非常關鍵了,這裡我就 grep 了這些名額作為樣例。
通過
--test
看到輸出了,就說明正常采集到資料了,你有幾個工作負載節點,就分别去修改 Categraf 的配置即可。當然,這樣做非常直覺,隻是略麻煩,如果未來擴容新的 Node 節點,也要去修改 Categraf 的采集配置,把 Kube-Proxy 這個
/metrics
位址給加上,如果你是用腳本批量跑的,倒是還可以,如果是手工部署就略麻煩。我們可以把 Categraf 采集器做成 Daemonset,這樣就不用擔心擴容的問題了,Daemonset 會被自動排程到所有 Node 節點。
Categraf 作為 Daemonset 部署
Categraf 作為 Daemonset 運作,首先要建立一個 namespace,然後相關的 ConfigMap、Daemonset 等都歸屬這個 namespace。隻是監控 Kube-Proxy 的話,Categraf 的配置就隻需要主配置 config.toml 和 prometheus.toml,下面我們就實操示範一下。
建立 namespace
[[email protected] categraf]$ kubectl create namespace flashcat
namespace/flashcat created
[[email protected] categraf]$ kubectl get ns | grep flashcat
flashcat Active 29s
建立 ConfigMap
ConfigMap 是用于放置 config.toml 和 prometheus.toml 的内容,我把 yaml 檔案也給你準備好了,請儲存為 categraf-configmap-v1.yaml
---
kind: ConfigMap
metadata:
name: categraf-config
apiVersion: v1
data:
config.toml: |
[global]
hostname = "$HOSTNAME"
interval = 15
providers = ["local"]
[writer_opt]
batch = 2000
chan_size = 10000
[[writers]]
url = "http://10.206.0.16:19000/prometheus/v1/write"
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
---
kind: ConfigMap
metadata:
name: categraf-input-prometheus
apiVersion: v1
data:
prometheus.toml: |
[[instances]]
urls = ["http://127.0.0.1:10249/metrics"]
labels = { job="kube-proxy" }
上面的
10.206.0.16:19000
隻是舉個例子,請改成你自己的 n9e-server 的位址。當然,如果不想把監控資料推給 Nightingale 也OK,寫成其他的時序庫(支援 remote write 協定的接口)也可以。
hostname = "$HOSTNAME"
這個配置用了
$
符号,後面建立 Daemonset 的時候會把 HOSTNAME 這個環境變量注入,讓 Categraf 自動拿到。
下面我們把 ConfigMap 建立出來:
[[email protected] yamls]$ kubectl apply -f categraf-configmap-v1.yaml -n flashcat
configmap/categraf-config created
configmap/categraf-input-prometheus created
[[email protected] yamls]$ kubectl get configmap -n flashcat
NAME DATA AGE
categraf-config 1 19s
categraf-input-prometheus 1 19s
kube-root-ca.crt 1 22m
建立 Daemonset
配置檔案準備好了,開始建立 Daemonset,注意把 HOSTNAME 環境變量注入進去,yaml 檔案如下,你可以儲存為 categraf-daemonset-v1.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: categraf-daemonset
name: categraf-daemonset
spec:
selector:
matchLabels:
app: categraf-daemonset
template:
metadata:
labels:
app: categraf-daemonset
spec:
containers:
- env:
- name: TZ
value: Asia/Shanghai
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: HOSTIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: flashcatcloud/categraf:v0.2.18
imagePullPolicy: IfNotPresent
name: categraf
volumeMounts:
- mountPath: /etc/categraf/conf
name: categraf-config
- mountPath: /etc/categraf/conf/input.prometheus
name: categraf-input-prometheus
hostNetwork: true
restartPolicy: Always
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- configMap:
name: categraf-config
name: categraf-config
- configMap:
name: categraf-input-prometheus
name: categraf-input-prometheus
apply 一下這個 Daemonset 檔案:
[[email protected] yamls]$ kubectl apply -f categraf-daemonset-v1.yaml -n flashcat
daemonset.apps/categraf-daemonset created
[[email protected] yamls]$ kubectl get ds -o wide -n flashcat
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
categraf-daemonset 6 6 6 6 6 <none> 2m20s categraf flashcatcloud/categraf:v0.2.17 app=categraf-daemonset
[[email protected] yamls]$ kubectl get pods -o wide -n flashcat
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
categraf-daemonset-4qlt9 1/1 Running 0 2m10s 10.206.0.7 10.206.0.7 <none> <none>
categraf-daemonset-s9bk2 1/1 Running 0 2m10s 10.206.0.11 10.206.0.11 <none> <none>
categraf-daemonset-w77lt 1/1 Running 0 2m10s 10.206.16.3 10.206.16.3 <none> <none>
categraf-daemonset-xgwf5 1/1 Running 0 2m10s 10.206.0.16 10.206.0.16 <none> <none>
categraf-daemonset-z9rk5 1/1 Running 0 2m10s 10.206.16.8 10.206.16.8 <none> <none>
categraf-daemonset-zdp8v 1/1 Running 0 2m10s 10.206.0.17 10.206.0.17 <none> <none>
看起來一切正常,我們去 Nightingale 查一下相關監控名額,看看有了沒有:
監控名額說明
Kube-Proxy 的名額,孔飛老師之前整理過,我也給挪到這個章節,供大家參考:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
gc時間
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
goroutine數量
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
線程數量
# HELP kubeproxy_network_programming_duration_seconds [ALPHA] In Cluster Network Programming Latency in seconds
# TYPE kubeproxy_network_programming_duration_seconds histogram
service或者pod發生變化到kube-proxy規則同步完成時間名額含義較複雜,參照https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md
# HELP kubeproxy_sync_proxy_rules_duration_seconds [ALPHA] SyncProxyRules latency in seconds
# TYPE kubeproxy_sync_proxy_rules_duration_seconds histogram
規則同步耗時
# HELP kubeproxy_sync_proxy_rules_endpoint_changes_pending [ALPHA] Pending proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_pending gauge
endpoint 發生變化後規則同步pending的次數
# HELP kubeproxy_sync_proxy_rules_endpoint_changes_total [ALPHA] Cumulative proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_total counter
endpoint 發生變化後規則同步的總次數
# HELP kubeproxy_sync_proxy_rules_iptables_restore_failures_total [ALPHA] Cumulative proxy iptables restore failures
# TYPE kubeproxy_sync_proxy_rules_iptables_restore_failures_total counter
本機上 iptables restore 失敗的總次數
# HELP kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds [ALPHA] The last time a sync of proxy rules was queued
# TYPE kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds gauge
最近一次規則同步的請求時間戳,如果比下一個名額 kubeproxy_sync_proxy_rules_last_timestamp_seconds 大很多,那說明同步 hung 住了
# HELP kubeproxy_sync_proxy_rules_last_timestamp_seconds [ALPHA] The last time proxy rules were successfully synced
# TYPE kubeproxy_sync_proxy_rules_last_timestamp_seconds gauge
最近一次規則同步的完成時間戳
# HELP kubeproxy_sync_proxy_rules_service_changes_pending [ALPHA] Pending proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_pending gauge
service變化引起的規則同步pending數量
# HELP kubeproxy_sync_proxy_rules_service_changes_total [ALPHA] Cumulative proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_total counter
service變化引起的規則同步總數
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
利用這個名額統計cpu使用率
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
程序可以打開的最大fd數
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
程序目前打開的fd數
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
統計記憶體使用大小
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
程序啟動時間戳
# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求 apiserver 的耗時(按照url和verb統計)
# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求 apiserver 的總數(按照code method host統計)
導入監控大盤
相關文章
- Kubernetes監控手冊01-體系介紹
- Kubernetes監控手冊02-宿主監控概述
- Kubernetes監控手冊03-宿主監控實操