天天看點

【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

文章目錄

  • 1、Prometheus 監控告警系統
      • (1)Alertmanager
      • (2)Alertmanager 環境搭建
      • (3)Prometheus 配置 Alertmanager
      • (4)測試郵件發送以及接收
  • 微信公衆号

1、Prometheus 監控告警系統

(1)Alertmanager

Alertmanager處理用戶端應用程式(如Prometheus伺服器)發送的警報。它負責重複資料删除、分組和路由它們到正确的接收端內建,如電子郵件、PagerDuty或OpsGenie。它還負責警報的沉默和抑制

分組

分組将類似性質的警報分類為單個通知。這在較大的中斷中特别有用,當大量系統一次出現故障并且可能同時觸發數百到數千個警報時。

示例:發生網絡分區時,群集中正在運作數十個或數百個服務執行個體。您有一半的服務執行個體不再可以通路資料庫。Prometheus中的警報規則配置為在每個服務執行個體無法與資料庫通信時為其發送警報。結果,數百個警報被發送到Alertmanager。

作為使用者,人們隻希望獲得一個頁面,同時仍然能夠準确檢視受影響的服務執行個體。是以,可以将Alertmanager配置為按警報的群集和警報名稱分組警報,以便它發送一個緊湊的通知。

警報的分組,分組通知的時間以及這些通知的接收者由配置檔案中的路由樹配置。

抑制

禁止是一種概念,如果某些其他警報已經觸發,則抑制某些警報的通知。

示例:正在觸發警報,通知您整個群集無法通路。可以将Alertmanager配置為使與該群集有關的所有其他警報靜音。這樣可以防止與實際問題無關的數百或數千個觸發警報的通知。

通過Alertmanager的配置檔案配置禁止。

沉默

靜默是一種簡單的方法,可以在給定時間内簡單地使警報靜音。沉默是根據比對器配置的,就像路由樹一樣。檢查傳入警報是否與活動靜默的所有相等或正規表達式比對項比對。如果這樣做,則不會針對該警報發送任何通知。

在Alertmanager的Web界面中配置沉默。

  • Github alertmanager

(2)Alertmanager 環境搭建

拉取 Prometheus 提供的監控告警鏡像

docker pull prom/alertmanager
           

在挂載的目錄下添加 alertmanager 配置檔案 alertmanager.yml

# 全局參數
global:
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: '********'
  smtp_require_tls: false   #要設定後才能發送成功,預設是true

route:
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'mail-receiver'
# 接收者
receivers:
  - name: 'mail-receiver'
    email_configs:
    - to: '[email protected]'
# 告警抑制
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
           

建構 alertmanager 容器

docker run -d -p 9093:9093 \
-v /tellsea/docker/alertmanager/:/etc/alertmanager/ \
--name=alertmanager prom/alertmanager
           

啟動 alertmanager 容器

docker start alertmanager
           

浏覽器通路

http://101.200.127.67:9093/#/alerts
           

通路效果如下

【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

(3)Prometheus 配置 Alertmanager

在 Prometheus 挂載的目錄下添加報警規則的配置檔案

  • 建立 node-exporter-record-rule.yml
groups:
  - name: node-exporter-record
    rules:
    - expr: up{job=~"node-exporter"}
      record: node_exporter:up
      labels:
        desc: "節點是否線上, 線上1,不線上0"
        unit: " "
        job: "node-exporter"
    - expr: time() - node_boot_time_seconds{}
      record: node_exporter:node_uptime
      labels:
        desc: "節點的運作時間"
        unit: "s"
        job: "node-exporter"
##############################################################################################
#                              cpu                                                           #
    - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m])))  * 100
      record: node_exporter:cpu:total:percent
      labels:
        desc: "節點的cpu總消耗百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m])))  * 100
      record: node_exporter:cpu:idle:percent
      labels:
        desc: "節點的cpu idle百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="iowait"}[5m])))  * 100
      record: node_exporter:cpu:iowait:percent
      labels:
        desc: "節點的cpu iowait百分比"
        unit: "%"
        job: "node-exporter"


    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="system"}[5m])))  * 100
      record: node_exporter:cpu:system:percent
      labels:
        desc: "節點的cpu system百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="user"}[5m])))  * 100
      record: node_exporter:cpu:user:percent
      labels:
        desc: "節點的cpu user百分比"
        unit: "%"
        job: "node-exporter"

    - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode=~"softirq|nice|irq|steal"}[5m])))  * 100
      record: node_exporter:cpu:other:percent
      labels:
        desc: "節點的cpu 其他的百分比"
        unit: "%"
        job: "node-exporter"
##############################################################################################


##############################################################################################
#                                    memory                                                  #
    - expr: node_memory_MemTotal_bytes{job="node-exporter"}
      record: node_exporter:memory:total
      labels:
        desc: "節點的記憶體總量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemFree_bytes{job="node-exporter"}
      record: node_exporter:memory:free
      labels:
        desc: "節點的剩餘記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"}
      record: node_exporter:memory:used
      labels:
        desc: "節點的已使用記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"}
      record: node_exporter:memory:actualused
      labels:
        desc: "節點使用者實際使用的記憶體量"
        unit: byte
        job: "node-exporter"

    - expr: (1-(node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
      record: node_exporter:memory:used:percent
      labels:
        desc: "節點的記憶體使用百分比"
        unit: "%"
        job: "node-exporter"

    - expr: ((node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
      record: node_exporter:memory:free:percent
      labels:
        desc: "節點的記憶體剩餘百分比"
        unit: "%"
        job: "node-exporter"
##############################################################################################
#                                   load                                                     #
    - expr: sum by (instance) (node_load1{job="node-exporter"})
      record: node_exporter:load:load1
      labels:
        desc: "系統1分鐘負載"
        unit: " "
        job: "node-exporter"

    - expr: sum by (instance) (node_load5{job="node-exporter"})
      record: node_exporter:load:load5
      labels:
        desc: "系統5分鐘負載"
        unit: " "
        job: "node-exporter"

    - expr: sum by (instance) (node_load15{job="node-exporter"})
      record: node_exporter:load:load15
      labels:
        desc: "系統15分鐘負載"
        unit: " "
        job: "node-exporter"

##############################################################################################
#                                 disk                                                       #
    - expr: node_filesystem_size_bytes{job="node-exporter" ,fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:total
      labels:
        desc: "節點的磁盤總量"
        unit: byte
        job: "node-exporter"

    - expr: node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:free
      labels:
        desc: "節點的磁盤剩餘空間"
        unit: byte
        job: "node-exporter"

    - expr: node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
      record: node_exporter:disk:usage:used
      labels:
        desc: "節點的磁盤使用的空間"
        unit: byte
        job: "node-exporter"

    - expr:  (1 - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"}) * 100
      record: node_exporter:disk:used:percent
      labels:
        desc: "節點的磁盤的使用百分比"
        unit: "%"
        job: "node-exporter"

    - expr: irate(node_disk_reads_completed_total{job="node-exporter"}[1m])
      record: node_exporter:disk:read:count:rate
      labels:
        desc: "節點的磁盤讀取速率"
        unit: "次/秒"
        job: "node-exporter"

    - expr: irate(node_disk_writes_completed_total{job="node-exporter"}[1m])
      record: node_exporter:disk:write:count:rate
      labels:
        desc: "節點的磁盤寫入速率"
        unit: "次/秒"
        job: "node-exporter"

    - expr: (irate(node_disk_written_bytes_total{job="node-exporter"}[1m]))/1024/1024
      record: node_exporter:disk:read:mb:rate
      labels:
        desc: "節點的裝置讀取MB速率"
        unit: "MB/s"
        job: "node-exporter"

    - expr: (irate(node_disk_read_bytes_total{job="node-exporter"}[1m]))/1024/1024
      record: node_exporter:disk:write:mb:rate
      labels:
        desc: "節點的裝置寫入MB速率"
        unit: "MB/s"
        job: "node-exporter"

##############################################################################################
#                                filesystem                                                  #
    - expr:   (1 -node_filesystem_files_free{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_files{job="node-exporter",fstype=~"ext4|xfs"}) * 100
      record: node_exporter:filesystem:used:percent
      labels:
        desc: "節點的inode的剩餘可用的百分比"
        unit: "%"
        job: "node-exporter"
#############################################################################################
#                                filefd                                                     #
    - expr: node_filefd_allocated{job="node-exporter"}
      record: node_exporter:filefd_allocated:count
      labels:
        desc: "節點的檔案描述符打開個數"
        unit: "%"
        job: "node-exporter"

    - expr: node_filefd_allocated{job="node-exporter"}/node_filefd_maximum{job="node-exporter"} * 100
      record: node_exporter:filefd_allocated:percent
      labels:
        desc: "節點的檔案描述符打開百分比"
        unit: "%"
        job: "node-exporter"

#############################################################################################
#                                network                                                    #
    - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:bit:rate
      labels:
        desc: "節點網卡eth0每秒接收的比特數"
        unit: "bit/s"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:bit:rate
      labels:
        desc: "節點網卡eth0每秒發送的比特數"
        unit: "bit/s"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:packet:rate
      labels:
        desc: "節點網卡每秒接收的資料包個數"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:packet:rate
      labels:
        desc: "節點網卡發送的資料包個數"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netin:error:rate
      labels:
        desc: "節點裝置驅動器檢測到的接收錯誤包的數量"
        unit: "個/秒"
        job: "node-exporter"

    - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
      record: node_exporter:network:netout:error:rate
      labels:
        desc: "節點裝置驅動器檢測到的發送錯誤包的數量"
        unit: "個/秒"
        job: "node-exporter"

    - expr: node_tcp_connection_states{job="node-exporter", state="established"}
      record: node_exporter:network:tcp:established:count
      labels:
        desc: "節點目前established的個數"
        unit: "個"
        job: "node-exporter"

    - expr: node_tcp_connection_states{job="node-exporter", state="time_wait"}
      record: node_exporter:network:tcp:timewait:count
      labels:
        desc: "節點timewait的連接配接數"
        unit: "個"
        job: "node-exporter"

    - expr: sum by (environment,instance) (node_tcp_connection_states{job="node-exporter"})
      record: node_exporter:network:tcp:total:count
      labels:
        desc: "節點tcp連接配接總數"
        unit: "個"
        job: "node-exporter"

#############################################################################################
#                                process                                                    #
    - expr: node_processes_state{state="Z"}
      record: node_exporter:process:zoom:total:count
      labels:
        desc: "節點目前狀态為zoom的個數"
        unit: "個"
        job: "node-exporter"
#############################################################################################
#                                other                                                    #
    - expr: abs(node_timex_offset_seconds{job="node-exporter"})
      record: node_exporter:time:offset
      labels:
        desc: "節點的時間偏差"
        unit: "s"
        job: "node-exporter"

#############################################################################################

    - expr: count by (instance) ( count by (instance,cpu) (node_cpu_seconds_total{ mode='system'}) )
      record: node_exporter:cpu:count

           
  • 建立 node-exporter-alert-rule.yml
groups:
  - name: node-exporter-alert
    rules:
    - alert: node-exporter-down
      expr: node_exporter:up == 0
      for: 1m
      labels:
        severity: 'critical'
      annotations:
        summary: "instance: {{ $labels.instance }} 當機了"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 關機了, 時間已經1分鐘了。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"



    - alert: node-exporter-cpu-high
      expr:  node_exporter:cpu:total:percent > 80
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已經持續三分鐘高過80% 。"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: node-exporter-cpu-iowait-high
      expr:  node_exporter:cpu:iowait:percent >= 12
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已經持續三分鐘高過12%"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-load-load1-high
      expr:  (node_exporter:load:load1) > (node_exporter:cpu:count) * 1.2
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-memory-high
      expr:  node_exporter:memory:used:percent > 85
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-high
      expr:  node_exporter:disk:used:percent > 88
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-read:count-high
      expr:  node_exporter:disk:read:count:rate > 3000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-disk-write-count-high
      expr:  node_exporter:disk:write:count:rate > 3000
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"





    - alert: node-exporter-disk-read-mb-high
      expr:  node_exporter:disk:read:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 讀取位元組數 高于 {{ $value }}"
        description: ""
        instance: "{{ $labels.instance }}"
        value: "{{ $value }}"


    - alert: node-exporter-disk-write-mb-high
      expr:  node_exporter:disk:write:mb:rate > 60
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 寫入位元組數 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-filefd-allocated-percent-high
      expr:  node_exporter:filefd_allocated:percent > 80
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 打開檔案描述符 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-netin-error-rate-high
      expr:  node_exporter:network:netin:error:rate > 4
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包進入的錯誤速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: node-exporter-network-netin-packet-rate-high
      expr:  node_exporter:network:netin:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包進入速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-netout-packet-rate-high
      expr:  node_exporter:network:netout:packet:rate > 35000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-network-tcp-total-count-high
      expr:  node_exporter:network:tcp:total:count > 40000
      for: 1m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} tcp連接配接數量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-process-zoom-total-count-high
      expr:  node_exporter:process:zoom:total:count > 10
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} 僵死程序數量 高于 {{ $value }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"


    - alert: node-exporter-time-offset-high
      expr:  node_exporter:time:offset > 0.03
      for: 2m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} {{ $labels.desc }}  {{ $value }} {{ $labels.unit }}"
        description: ""
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

           

修改 Prometheus 配置檔案,增加報警資訊

rule_files:
  # 通配規則檔案
  - "*rule.yml"
# Alertmanager 配置
alerting:
  alertmanagers:
  - scheme: http
  - static_configs:
    - targets: ["101.200.127.67:9093"]
           

重新開機一下Prometheus

docker restart prometheus
           

檢視報警資訊

http://101.200.127.67:9090/alerts
           
【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

檢視規則檔案配置

http://101.200.127.67:9090/rules
           
【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

(4)測試郵件發送以及接收

觸發郵件通知隻需要滿足規則制定的條件即可,這裡我直接示範關閉一個元件

docker stop node-exporter
           

然後過一會兒通路 alertmananger 發現多了點資訊

http://101.200.127.67:9093/#/alerts
           
【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

Prometheus 中也能看到相應的錯誤資訊

http://101.200.127.67:9090/alerts
           
【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

然後過了一會兒(時間久一點),收到QQ郵件

【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号

微信公衆号

【Docker學習】14、使用 Docker 搭建 Prometheus 監控告警系統(郵件通知)1、Prometheus 監控告警系統微信公衆号