天天看點

Prometheus監控docker容器Prometheus(普羅米修斯)監控

@

Prometheus(普羅米修斯)監控

Prometheus是最初在SoundCloud上建構的開源系統監視和警報工具包 。自2012年成立以來,許多公司群組織都采用了Prometheus,該項目擁有非常活躍的開發人員和使用者社群 。

使用prometheus的特性

  • 易管理性

Prometheus核心部分隻有一個單獨的二進制檔案,可直接在本地工作,不依賴于分布式存儲

  • 不依賴分布式存儲,單伺服器節點是自治的
  • 高效

    單一Prometheus可以處理數以百萬的監控名額;每秒處理數十萬

    的資料點

  • 易于伸縮

    Prometheus提供多種語言 的用戶端SDK,這些SDK可以快速讓應用程式納入到Prometheus的監控當中

  • 通過服務發現或靜态配置發現目标
  • 良好的可視化

除了自帶的可視化web界面,還有另外最新的

Grafana

可視化工具也提供了完整的Proetheus支援,基于 Prometheus提供的API還可以實作自己的監控可視化UI

docker搭建prometheus監控

環境:

  • 全部關閉防火牆,禁用selinux
主機 IP 安裝元件
machine 172.16.46.111 NodeEXporter、cAdvisor、 Prometheus Server、Grafana
node01 172.16.46.112 NodeEXporter、cAdvisor
node02 172.16.46.113 NodeEXporter、cAdvisor

安裝prometheus元件說明:

Prometheus Server:

普羅米修斯的主伺服器,端口号9090

NodeEXporter:

負責收集Host硬體資訊和作業系統資訊,端口号9100

cAdvisor:

負責收集Host上運作的容器資訊,端口号占用8080

Grafana:

負責展示普羅米修斯監控界面,端口号3000

altermanager:

等待接收prometheus發過來的告警資訊,altermanager再發送給定義的收件人

部署node-EXporter,收集硬體和系統資訊

#3台主機都要安裝
docker run -d -p 9100:9100 -v /proc:/host/proc -v /sys:/host/sys -v /:/rootfs --net=host prom/node-exporter --path.procfs /host/proc --path.sysfs /host/sys --collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
           

PS:注意,為了網絡的高效率,我們的網絡使用的是host

驗證收集效果

Prometheus監控docker容器Prometheus(普羅米修斯)監控

部署安裝cAdvisor,收集節點容器資訊

#3台都要安裝
docker run -v /:/rootfs:ro -v /var/run:/var/run/:rw -v /sys:/sys:ro -v /var/lib/docker:/var/lib/docker:ro -p 8080:8080 --detach=true --name=cadvisor --net=host google/cadvisor
           

驗證收集效果傳遞

Prometheus監控docker容器Prometheus(普羅米修斯)監控

部署prometheus-server服務

先啟動一個prometheus服務,目的是複制他的配置檔案,修改配置檔案,prometheus挂載這個檔案

mkdir /prometheus
docker run -d --name test -P prom/prometheus
docker cp test:/etc/prometheus/prometheus.yml /prometheus
#編輯prometheus配置檔案,在static_configs下面修改為
#以下添加的ip都将會被監控起來
    - targets: ['localhost:9090','localhost:8080','localhost:9100','172.16.46.112:8080','172.16.46.112:9100','172.16.46.113:8080','172.16.46.113:9100']
           

重新運作prometheus服務

收集cAdvisor和nodexporter的資訊到prometheus

docker rm -f test
docker run -d --name prometheus --net host -p 9090:9090 -v /prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
           

通路測試

進入首頁會看到

Prometheus監控docker容器Prometheus(普羅米修斯)監控
Prometheus監控docker容器Prometheus(普羅米修斯)監控

在這裡會出現簡單的圖形展示,和顯然,這樣看的話還得根據條件篩選着看,而且界面很簡單,是以我們還要接入grafana

Prometheus監控docker容器Prometheus(普羅米修斯)監控

在prometheus伺服器上部署grafana

grafana主要概念

  • 插件:擴充功能作用,完成不能完成的事
  • 資料源:連接配接資料源,通過資料源提供資料 來出圖
  • dashboard:展示面闆,出什麼樣的圖

grafana在zabbix應用參考: https://blog.csdn.net/weixin_43815140/article/details/106109605

部署grafana

mkdir /grafana
chmod 777 -R /grafana
docker run -d -p 3000:3000 --name grafana -v /grafana:/var/lib/grafana -e "GF_SECURITY_ADMIN_PASSWORD=123.com" grafana/grafana
           
Prometheus監控docker容器Prometheus(普羅米修斯)監控

通路測試

預設的使用者名和密碼:

username:admin

password:123.com

Prometheus監控docker容器Prometheus(普羅米修斯)監控

登入進去,添加資料源

Prometheus監控docker容器Prometheus(普羅米修斯)監控

連接配接prometheus

Prometheus監控docker容器Prometheus(普羅米修斯)監控

點選save&test開始開始連接配接

Prometheus監控docker容器Prometheus(普羅米修斯)監控

看到以上說明連接配接成功,不過還需要dashboard來展示圖案

prometheus提供3種自帶的方案

Prometheus監控docker容器Prometheus(普羅米修斯)監控

import後看一下效果

Prometheus監控docker容器Prometheus(普羅米修斯)監控

看起來效果比原來好的太多了,不過

grafana官網提供了更多的模闆讓我們選擇 官網模闆根據我們的需求可以在官網挑選一款合适自己環境的模闆不是很難。

Prometheus監控docker容器Prometheus(普羅米修斯)監控

導入模闆的2種方式

  • 下載下傳JSON檔案到本地,uoload上傳導入
  • 直接輸入ID,load就會自動加載到這個模闆
Prometheus監控docker容器Prometheus(普羅米修斯)監控

我們就導入上面的模闆為例

Prometheus監控docker容器Prometheus(普羅米修斯)監控
Prometheus監控docker容器Prometheus(普羅米修斯)監控

小微調試以後,會出現

Prometheus監控docker容器Prometheus(普羅米修斯)監控

不過這隻是監控的主控端資源資訊,如果我們想看docker容器的資訊

在官網查找與docker有關的模闆導入并使用

找到一款全部适合的(ID:11600)

Prometheus監控docker容器Prometheus(普羅米修斯)監控
Prometheus監控docker容器Prometheus(普羅米修斯)監控

配置Alertmanager報警

啟動 AlertManager 來接受 Prometheus 發送過來的報警資訊,并執行各種方式的報警。

alertmanager與prometheus工作流程如下

Prometheus監控docker容器Prometheus(普羅米修斯)監控
  • prometheus收集監測的資訊
  • prometheus.yml檔案定義rules檔案,rules裡包括了告警資訊
  • prometheus把報警資訊push給alertmanager ,alertmanager裡面有定義收件人和發件人
  • alertmanager發送檔案給郵箱或微信

告警等級

Prometheus監控docker容器Prometheus(普羅米修斯)監控

同樣以 Docker 方式啟動 AlertManager

同prometheus一樣,先啟動一個test容器,拷貝下來alertmanager的配置檔案

mkdir /alertmanager
docker run -d --name test -p 9093:9093 prom/alertmanager
docker cp test:/etc/alertmanager/alertmanager.yml /alertmanager
cp alertmanager.yml alertmanager.yml.bak
           
AlertManager 預設配置檔案為 alertmanager.yml,在容器内路徑為

/etc/alertmanager/alertmanager.yml

這裡

AlertManager

預設啟動的端口為 9093,啟動完成後,浏覽器通路http://:9093 可以看到預設提供的 UI 頁面,不過現在是沒有任何告警資訊的,因為我們還沒有配置報警規則來觸發報警

Prometheus監控docker容器Prometheus(普羅米修斯)監控

配置alertmanager郵箱報警

檢視alertmanager的配置檔案

Prometheus監控docker容器Prometheus(普羅米修斯)監控

簡單介紹一下主要配置的作用:簡單介紹一下主要配置的作用:

global: 全局配置,包括報警解決後的逾時時間、SMTP 相關配置、各種管道通知的 API 位址等等。

route: 用來設定報警的分發政策,它是一個樹狀結構,按照深度優先從左向右的順序進行比對。

receivers: 配置告警消息接受者資訊,例如常用的 email、wechat、slack、webhook 等消息通知方式。

inhibit_rules: 抑制規則配置,當存在與另一組比對的警報(源)時,抑制規則将禁用與一組比對的報警(目标)。

配置郵箱報警,首先我們郵箱需要開啟SMTP服務,并擷取唯一辨別碼

Prometheus監控docker容器Prometheus(普羅米修斯)監控
Prometheus監控docker容器Prometheus(普羅米修斯)監控
Prometheus監控docker容器Prometheus(普羅米修斯)監控

編輯alertmanager.yml檔案

編輯報警媒介等相關資訊

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'  #定義發送的郵箱
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'iwxrdwmdgofdbbdc'
  smtp_require_tls: false
  smtp_hello: 'qq.com'
route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]' #定義接收的郵箱
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

           

重新開機alertmanager容器

docker rm -f test
docker run -d --name alertmanager -p 9093:9093 -v /alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml prom/alertmanager
           
Prometheus監控docker容器Prometheus(普羅米修斯)監控

prometheus添加alertmanager報警規則

接下來,我們需要在 Prometheus 配置 AlertManager 服務位址以及告警規則,建立報警規則檔案

node-up.rules

如下

mkdir /prometheus/rules
cd /prometheus/rules
vim node-up.rules
groups:
- name: node-up
  rules:
  - alert: node-up
    expr: up{job="prometheus"} == 0
    for: 15s
    labels:
      severity: 1 
      team: node
    annotations:
      summary: "{{ $labels.instance }} 已停止運作超過 15s!"
           
Prometheus監控docker容器Prometheus(普羅米修斯)監控

修改prometheus.yml檔案,添加rules規則

Prometheus監控docker容器Prometheus(普羅米修斯)監控
PS:這裡 rule_files 為容器内路徑,需要将本地 node-up.rules 檔案挂載到容器内指定路徑,修改 Prometheus 啟動指令如下,并重新開機服務。
docker rm -f prometheus
docker run -d --name prometheus -p 9090:9090 -v /prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /prometheus/rules:/usr/local/prometheus/rules --net host prom/prometheus
           

在prometheus上檢視相應的規則

Prometheus監控docker容器Prometheus(普羅米修斯)監控

觸發報警發送郵件

關掉其中一個服務就ok

[[email protected] ~]# docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
8d1cc177b58e        google/cadvisor      "/usr/bin/cadvisor -…"   4 hours ago         Up 4 hours                              cadvisor
b2417dbd850f        prom/node-exporter   "/bin/node_exporter …"   4 hours ago         Up 4 hours                              gallant_proskuriakova
[[email protected] ~]# docker stop 8d1cc177b58e
8d1cc177b58e
           
Prometheus監控docker容器Prometheus(普羅米修斯)監控

檢視郵件

Prometheus監控docker容器Prometheus(普羅米修斯)監控

Alertmanager自定義郵件報警

上面雖然已經可以做出報警,但是我們想讓報警資訊更加直覺一些

alertmanager支援自定義郵件模闆的

首先建立一個模闆檔案 email.tmpl

mkdir /alertmanager/template
vim email.tmpl
{{ define "email.from" }}[email protected]{{ end }}
{{ define "email.to" }}[email protected]{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程式: prometheus_alert<br>
告警級别: {{ .Labels.severity }} 級<br>
告警類型: {{ .Labels.alertname }}<br>
故障主機: {{ .Labels.instance }}<br>
告警主題: {{ .Annotations.summary }}<br>
觸發時間: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
           

修改alertmanager檔案

Prometheus監控docker容器Prometheus(普羅米修斯)監控

重建altermanager

[[email protected] ~]# docker rm -f alertmanager 
[[email protected] ~]# docker run -d --name alertmanager -p 9093:9093 -v /root/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /alertmanager/template:/etc/alertmanager-tmpl prom/alertmanager
           

測試

Prometheus監控docker容器Prometheus(普羅米修斯)監控

PS:模闆裡面的

{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

是我們添加東八區時間以後的樣子.

擴充案例:添加主機CPU,記憶體,磁盤等報警監控規則

recording_rules規則: 記錄規則允許您預先計算經常需要的或計算開銷較大的表達式,并将它們的結果儲存為一組新的時間序列。查詢預先計算的結果通常比每次需要執行原始表達式時要快得多

alert_rules規則: 警報規則允許您基于Prometheus表達式語言表達式定義警報條件,并向外部服務發送關于觸發警報的通知

PS: 記錄和警報規則的名稱必須是有效的名額名稱

因為我們已經定義了規則存放在

/prometheus/rules/*.rules

結尾的檔案

是以,我們可以把所有的規則寫到這裡

在prometheus.yml同級目錄下建立兩個報警規則配置檔案node-exporter-record-rule.yml,node-exporter-alert-rule.yml。第一個檔案用于記錄規則,第二個是報警規則。

#prometheus關于報警規則配置
rule_files:
  - "/usr/local/prometheus/rules/*.rules"
           

PS:*.rules裡面的job一定要與prometheus裡面對應

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'node-exporter'
           

添加記錄

#關于node-exporter-record.rules
groups:
  - name: node-exporter-record
    rules:
    - expr: up{job="node-exporter"}
      record: node_exporter:up
      labels:
        desc: "節點是否線上, 線上1,不線上0"
        unit: " "
        job: "node-exporter"
    - expr: time() - node_boot_time_seconds{}
      record: node_exporter:node_uptime
      labels:
        desc: "節點的運作時間"
        unit: "s"
        job: "node-exporter"
##############################################################################################
#                              cpu                                                           #
    - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m])))  * 100
      record: node_exporter:cpu:total:percent
      labels:
        desc: "節點的cpu總消耗百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m])))  * 100
      record: node_exporter:cpu:idle:percent
      labels:
        desc: "節點的cpu idle百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="iowait"}[5m])))  * 100
      record: node_exporter:cpu:iowait:percent
      labels:
        desc: "節點的cpu iowait百分比"
        unit: "%"
        job: "node-exporter"


    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="system"}[5m])))  * 100
      record: node_exporter:cpu:system:percent
      labels:
        desc: "節點的cpu system百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="user"}[5m])))  * 100
      record: node_exporter:cpu:user:percent
      labels:
        desc: "節點的cpu user百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode=~"softirq|nice|irq|steal"}[5m])))  * 100
      record: node_exporter:cpu:other:percent
      labels:
        desc: "節點的cpu 其他的百分比"
        unit: "%"
        job: "node-exporter"
##############################################################################################


##############################################################################################
#                                    memory                                                  #
    - expr: node_memory_MemTotal_bytes{job="node-exporter"}
      record: node_exporter:memory:total
      labels:
        desc: "節點的記憶體總量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemFree_bytes{job="node-exporter"}
      record: node_exporter:memory:free
      labels:
        desc: "節點的剩餘記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"}
      record: node_exporter:memory:used
      labels:
        desc: "節點的已使用記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"}
      record: node_exporter:memory:actualused
      labels:
        desc: "節點使用者實際使用的記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: (1-(node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
      record: node_exporter:memory:used:percent
      labels:
        desc: "節點的記憶體使用百分比"
        unit: "%"
        job: "node-exporter"

    - expr: ((node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
      record: node_exporter:memory:free:percent
      labels:
        desc: "節點的記憶體剩餘百分比"
        unit: "%"
        job: "node-exporter"
##############################################################################################
#                                   load                                                     #
    - expr: sum by (instance) (node_load1{job="node-exporter"})
      record: node_exporter:load:load1
      labels:
        desc: "系統1分鐘負載"
        unit: " "
        job: "node-exporter"

    - expr: sum by (instance) (node_load5{job="node-exporter"})
      record: node_exporter:load:load5
      labels:
        desc: "系統5分鐘負載"
        unit: " "
        job: "node-exporter"

    - expr: sum by (instance) (node_load15{job="node-exporter"})
      record: node_exporter:load:load15
      labels:
        desc: "系統15分鐘負載"
        unit: " "
        job: "node-exporter"

##############################################################################################
#                                 disk                                                       #
    - expr: node_filesystem_size_bytes{job="node-exporter" ,fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:total
      labels:
        desc: "節點的磁盤總量"
        unit: byte
        job: "node-exporter"

    - expr: node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:free
      labels:
        desc: "節點的磁盤剩餘空間"
        unit: byte
        job: "node-exporter"

    - expr: node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:used
      labels:
        desc: "節點的磁盤使用的空間"
        unit: byte
        job: "node-exporter"

    - expr:  (1 - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"}) * 100
      record: node_exporter:disk:used:percent
      labels:
        desc: "節點的磁盤的使用百分比"
        unit: "%"
        job: "node-exporter"

    - expr: irate(node_disk_reads_completed_total{job="node-exporter"}[1m])
      record: node_exporter:disk:read:count:rate
      labels:
        desc: "節點的磁盤讀取速率"
        unit: "次/秒"
        job: "node-exporter"

    - expr: irate(node_disk_writes_completed_total{job="node-exporter"}[1m])
      record: node_exporter:disk:write:count:rate
      labels:
        desc: "節點的磁盤寫入速率"
        unit: "次/秒"
        job: "node-exporter"

    - expr: (irate(node_disk_written_bytes_total{job="node-exporter"}[1m]))/1024/1024
      record: node_exporter:disk:read:mb:rate
      labels:
        desc: "節點的裝置讀取MB速率"
        unit: "MB/s"
        job: "node-exporter"

    - expr: (irate(node_disk_read_bytes_total{job="node-exporter"}[1m]))/1024/1024
      record: node_exporter:disk:write:mb:rate
      labels:
        desc: "節點的裝置寫入MB速率"
        unit: "MB/s"
        job: "node-exporter"

##############################################################################################
#                                filesystem                                                  #
    - expr:   (1 -node_filesystem_files_free{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_files{job="node-exporter",fstype=~"ext4|xfs"}) * 100
      record: node_exporter:filesystem:used:percent
      labels:
        desc: "節點的inode的剩餘可用的百分比"
        unit: "%"
        job: "node-exporter"
#############################################################################################
#                                filefd                                                     #
    - expr: node_filefd_allocated{job="node-exporter"}
      record: node_exporter:filefd_allocated:count
      labels:
        desc: "節點的檔案描述符打開個數"
        unit: "%"
        job: "node-exporter"

    - expr: node_filefd_allocated{job="node-exporter"}/node_filefd_maximum{job="node-exporter"} * 100
      record: node_exporter:filefd_allocated:percent
      labels:
        desc: "節點的檔案描述符打開百分比"
        unit: "%"
        job: "node-exporter"

#############################################################################################
#                                network                                                    #
    - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:bit:rate
      labels:
        desc: "節點網卡eth0每秒接收的比特數"
        unit: "bit/s"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:bit:rate
      labels:
        desc: "節點網卡eth0每秒發送的比特數"
        unit: "bit/s"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:packet:rate
      labels:
        desc: "節點網卡每秒接收的資料包個數"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:packet:rate
      labels:
        desc: "節點網卡發送的資料包個數"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:error:rate
      labels:
        desc: "節點裝置驅動器檢測到的接收錯誤包的數量"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:error:rate
      labels:
        desc: "節點裝置驅動器檢測到的發送錯誤包的數量"
        unit: "個/秒"
        job: "node-exporter"

    - expr: node_tcp_connection_states{job="node-exporter", state="established"}
      record: node_exporter:network:tcp:established:count
      labels:
        desc: "節點目前established的個數"
        unit: "個"
        job: "node-exporter"

    - expr: node_tcp_connection_states{job="node-exporter", state="time_wait"}
      record: node_exporter:network:tcp:timewait:count
      labels:
        desc: "節點timewait的連接配接數"
        unit: "個"
        job: "node-exporter"

    - expr: sum by (environment,instance) (node_tcp_connection_states{job="node-exporter"})
      record: node_exporter:network:tcp:total:count
      labels:
        desc: "節點tcp連接配接總數"
        unit: "個"
        job: "node-exporter"

#############################################################################################
#                                process                                                    #
    - expr: node_processes_state{state="Z"}
      record: node_exporter:process:zoom:total:count
      labels:
        desc: "節點目前狀态為zoom的個數"
        unit: "個"
        job: "node-exporter"
#############################################################################################
#                                other                                                    #
    - expr: abs(node_timex_offset_seconds{job="node-exporter"})
      record: node_exporter:time:offset
      labels:
        desc: "節點的時間偏差"
        unit: "s"
        job: "node-exporter"

#############################################################################################

    - expr: count by (instance) ( count by (instance,cpu) (node_cpu_seconds_total{ mode='system'}) )
      record: node_exporter:cpu:count

           

關于報警檔案

#關于node-exporer-alert.rules
groups:
  - name: node-exporter-alert
    rules:
    - alert: node-exporter-down
      expr: node_exporter:up == 0
      for: 1m
      labels:
        severity: 'critical'
      annotations:
        summary: "instance: {{ $labels.instance }} 當機了"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 關機了, 時間已經1分鐘了。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"



    - alert: node-exporter-cpu-high
      expr:  node_exporter:cpu:total:percent > 80
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已經持續三分鐘高過80% 。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: node-exporter-cpu-iowait-high
      expr:  node_exporter:cpu:iowait:percent >= 12
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已經持續三分鐘高過12%"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-load-load1-high
      expr:  (node_exporter:load:load1) > (node_exporter:cpu:count) * 1.2
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-memory-high
      expr:  node_exporter:memory:used:percent > 85
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-high
      expr:  node_exporter:disk:used:percent > 88
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-read:count-high
      expr:  node_exporter:disk:read:count:rate > 3000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-write-count-high
      expr:  node_exporter:disk:write:count:rate > 3000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"





    - alert: node-exporter-disk-read-mb-high
      expr:  node_exporter:disk:read:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 讀取位元組數 高于 {{ $value }}"
        description: ""
        instance: "{{ $labels.instance }}"
        value: "{{ $value }}"


    - alert: node-exporter-disk-write-mb-high
      expr:  node_exporter:disk:write:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 寫入位元組數 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-filefd-allocated-percent-high
      expr:  node_exporter:filefd_allocated:percent > 80
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 打開檔案描述符 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-netin-error-rate-high
      expr:  node_exporter:network:netin:error:rate > 4
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包進入的錯誤速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: node-exporter-network-netin-packet-rate-high
      expr:  node_exporter:network:netin:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包進入速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-netout-packet-rate-high
      expr:  node_exporter:network:netout:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-tcp-total-count-high
      expr:  node_exporter:network:tcp:total:count > 40000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} tcp連接配接數量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-process-zoom-total-count-high
      expr:  node_exporter:process:zoom:total:count > 10
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 僵死程序數量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-time-offset-high
      expr:  node_exporter:time:offset > 0.03
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} {{ $labels.desc }}  {{ $value }} {{ $labels.unit }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

           

配置企業微信報警

建立一個存放wechat.tmpl的目錄

mkdir /alertmanager/template/
cd /alertmanager/template/
vim wechat.tmpl
## 帶恢複告警的模版 注:alertmanager.yml wechat_configs中加上配置send_resolved: true
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
===alertmanager監控報警===
告警狀态:{{   .Status }}
告警級别:{{ $alert.Labels.severity }}
告警類型:{{ $alert.Labels.alertname }}
告警應用:{{ $alert.Annotations.summary }}
故障主機: {{ $alert.Labels.instance }}
告警主題: {{ $alert.Annotations.summary }}
觸發閥值:{{ $alert.Annotations.value }}
告警詳情: {{ $alert.Annotations.description }}
觸發時間: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
===========end============
{{ end }}
{{ end }}

           

修改alertmanager裡面的收件人

global:
  resolve_timeout: 5m
  wechat_api_corp_id: 'ww2b0ab679438a91fc'
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'Zl6pH_f-u2R1bwqDVPfLFygTR-JaYpH08vcTBr8xb0A'
templates:
  - '/etc/alertmanager/template/wechat.tmpl'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'wechat'
receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    to_party: '2'
    to_user: 'LiZhiSheng'
    agent_id: 1000005
    corp_id: 'ww2b0ab679438a91fc'
    api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
    api_secret: 'Zl6pH_f-u2R1bwqDVPfLFygTR-JaYpH08vcTBr8xb0A'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
           

prometheus配置不要變,我們還使用以前的rules規則

重新開機alertmanager

docker run -d --name alertmanager -p 9093:9093 -v /alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /alertmanager/template/:/etc/alertmanager/template prom/alertmanager
           

參數說明:

  • corp_id: 企業微信賬号唯一 ID, 可以在

    我的企業

    中檢視。
  • to_party: 需要發送的組。
  • to_user: 需要發送的使用者
  • agent_id: 第三方企業應用的 ID,可以在自己建立的第三方企業應用詳情頁面檢視。
  • api_secret: 第三方企業應用的密鑰,可以在自己建立的第三方企業應用詳情頁面檢視。

驗證:

停掉一台docker容器

Prometheus監控docker容器Prometheus(普羅米修斯)監控

恢複消息如下

Prometheus監控docker容器Prometheus(普羅米修斯)監控