prometheus+grafana+node_exporter+alertmanager監控主機及報警

安裝prometheus
安裝node_exporter
安裝grafana
安裝alertmanager
參考文檔

安裝prometheus

prometheus安裝

各個版本的Prometheus https://prometheus.io/download/

以linux系統為例，下載下傳編譯好的二進制包，解壓使用：

$ wget  https://github.com/prometheus/prometheus/releases/download/v2.11.1/prometheus-2.11.1.linux-amd64.tar.gz
$ tar xzvf prometheus-2.11.1.linux-amd64.tar.gz
$ mv prometheus-2.11.1.linux-amd64 /usr/local/prometheus

驗證安裝是否成功

$ cd /usr/local/prometheus
$ ./prometheus --version
prometheus, version 2.11.1 (branch: HEAD, revision: e5b22494857deca4b806f74f6e3a6ee30c251763)
  build user:       [email protected]
  build date:       20190710-13:51:17
  go version:       go1.12.7

編輯prometheus配置檔案

prometheus預設配置檔案在prometheus目錄下，檔案名為prometheus.yml，預設配置檔案内容如下：

$ cat /usr/local/prometheus/prometheus.yml

# Prometheus全局配置項
global:
  scrape_interval:     15s # 設定抓取資料的周期，預設為1min
  evaluation_interval: 15s # 設定更新rules檔案的周期，預設為1min
  scrape_timeout: 15s # 設定抓取資料的逾時時間，預設為10s
  external_labels: # 額外的屬性，會添加到拉取得資料并存到資料庫中
   monitor: 'codelab_monitor'


# Alertmanager配置
alerting:
 alertmanagers:
 - static_configs:
   - targets: ["localhost:9093"] # 設定alertmanager和prometheus互動的接口，即alertmanager監聽的ip位址和端口
     
# rule配置，首次讀取預設加載，之後根據evaluation_interval設定的周期加載
rule_files:
 - "alertmanager_rules.yml"
 - "prometheus_rules.yml"

# scape配置
scrape_configs:
- job_name: 'prometheus' # job_name預設寫入timeseries的labels中，可以用于查詢使用
  scrape_interval: 15s # 抓取周期，預設采用global配置
  static_configs: # 靜态配置
  - targets: ['localdns:9090'] # prometheus所要抓取資料的位址，即instance執行個體項

建立新使用者運作prometheus，家目錄為/var/lib/prometheus，用作存放prometheus的資料。

$ groupadd prometheus
$ useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus

建立systemd服務

$ vim /lib/systemd/system/prometheus.service

[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data \
--web.enable-admin-api \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target

$ mkdir /var/lib/prometheus/data

啟動prometheus

$ systemctl daemon-reload
$ systemctl start prometheus

驗證是否啟動成功

預設監聽端口為9090

$ systemctl status prometheus
$ netstat -lnpt|grep 9090

通路自帶的web

prometheus自帶web界面，可以檢視表達式搜尋結果、報警配置、prometheus配置、exporter資訊等。web界面預設為 http://ip:9090。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
也可以通路http://ip:9090/metrics,檢視預設抓取的資料。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
上面就是簡單啟動Prometheus，prometheus啟動時還有一些啟動選項。
Prometheus相關啟動選項

–config.file 指定啟動的配置檔案。例： --config.file=“prometheus.yml”

–web.listen-address 指定監聽ip及端口。例：–web.listen-address=“0.0.0.0:9090”

–web.enable-admin-api 為管理控制操作啟用API端點。

–web.enable-lifecycle 通過HTTP請求啟用關機和重新加載。

–storage.tsdb.path 指定prometheus資料存儲路徑。例： --storage.tsdb.path="/data/"

–storage.tsdb.retention.time 指定Prometheus資料存儲時間，預設存在15天。例：–storage.tsdb.retention.time=“24h”
删除prometheus資料資訊

控制管理 API 啟用後，可以使用下面的文法來删除與某個标簽比對的所有時間序列名額：

$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={kubernetes_name="prometheus"}'

如果要删除一些 job 任務或者 instance 的資料名額，則可以使用下面的指令：

$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="prometheus"}'
$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="prometheus"}'

要從 Prometheus 中删除所有的資料，可以使用如下指令：

$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'

清理某個時間段的資料（清理的時間戳區間：1557903714 到 155790395 ），用以下指令：

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?start=1557903714&end=1557903954&match[]={instance="prometheus",job="prometheus"}'

不過需要注意的是上面的 API 調用并不會立即删除資料，實際資料任然還存在磁盤上，會在後面進行資料清理。

安裝node_exporter

prometheus通過node_exporter提供的接口收集主機資訊。

安裝node_exporter

github上node_exporter相關文檔 https://github.com/prometheus/node_exporter

各個版本的node_exporter https://github.com/prometheus/node_exporter/releases

下載下傳編譯好的二進制包，解壓使用：

$ wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
$ tar -xvf node_exporter-0.18.1.linux-amd64.tar.gz
$ mv node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter

驗證安裝是否成功

$ ./node_exporter --version
node_exporter, version 0.18.1 (branch: HEAD, revision: 3db77732e925c08f675d7404a8c46466b2ece83e)
  build user:       [email protected]
  build date:       20190604-16:41:18
  go version:       go1.12.5

建立systemd服務

$ vim /lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

啟動node_exporter

$ systemctl daemon-reload
$ systemctl start node_exporter

驗證是否啟動成功

預設監聽端口為9100

$ systemctl status node_exporter
$ netstat -lnpt|grep 9100

prometheus.yml中加入node_exporter的配置

$ vim prometheus.yml

  - job_name: node_exporter  #自定義
    static_configs:
    - targets: ['127.0.0.1:9090']
      labels:
        instance: node_exporter #自定義
        group: node_exporter #自定義

重新加載prometheus的配置

$ systemctl reload prometheus
或
$ curl -X POST http://localhost:9090/-/reload  (啟用了--web.enable-lifecycle選項)

檢視是否配置成功

通路 http://127.0.0.1:9090。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
點選Targets，檢視添加的node資訊。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
通路 http://127.0.0.1:9100/metrics 檢視抓取的節點資訊。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔

安裝grafana

Grafana是用于可視化大型測量資料的開源程式，它提供了強大和優雅的方式去建立、共享、浏覽資料。Dashboard中顯示了不同metric資料源中的資料。

grafana官網 https://grafana.com

grafana各個版本 https://grafana.com/grafana/download

安裝grafana

以ubuntu系統安裝為例：

$ wget https://dl.grafana.com/oss/release/grafana_6.3.1_amd64.deb 
$ dpkg -i grafana_6.3.1_amd64.deb

檢視是否安裝成功

$ grafana-server -v
Version 6.3.1 (commit: f2fffad, branch: HEAD)

啟動grafana

$ service grafana-server start

驗證是否啟動成功

預設監聽端口為3000

$ service grafana-server status
$ netstat -lnpt|grep 3000

通路grafana

通路 http://127.0.0.1:3000 預設使用者名和密碼都為admin。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
添加資料源

點選Data Sources。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
資料源選擇Prometheus。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
名字為Prometheus，URL為 http://localhost:9090，其他預設就可以，儲存。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
在Dashboards頁面導入Prometheus Status模闆，這裡選擇導入官網的模闆。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔

點選import，可以輸入模闆的id，也可以上傳json檔案。

官網模版 https://grafana.com/grafana/dashboards

這裡用405号模闆，Prometheus選擇Prometheus，點選import。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
時間選擇最近5分鐘，此時會有資料。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔

根據自身需求可以導入其他模闆，也可以自己做儀表盤。

一些模闆需要依賴相應插件，可以去官網下載下傳，安裝說明官網文檔都有記載。

官網插件下載下傳網址 https://grafana.com/grafana/plugins

安裝alertmanager

安裝alertmanager

報警可以使用grafana自帶的報警，也可以通過alertmanager實作報警。

各個版本的alertmanager https://github.com/prometheus/alertmanager/releases

下載下傳編譯好的二進制檔案，解壓使用：

$ wget https://github.com/prometheus/alertmanager/releases/download/v0.18.0/alertmanager-0.18.0.linux-amd64.tar.gz
$ tar -xvf alertmanager-0.18.0.linux-amd64.tar.gz
$ mv alertmanager-0.18.0.linux-amd64 /usr/local/alertmanager

檢視是否安裝成功

$ cd /usr/local/alertmanager
$ ./alertmanager --version
alertmanager, version 0.18.0 (branch: HEAD, revision: 1ace0f76b7101cccc149d7298022df36039858ca)
  build user:       [email protected]
  build date:       20190708-14:31:49
  go version:       go1.12.6

修改主配置檔案

主配置檔案為alertmanager.yml

$ vim alertmanager.yml

# 全局配置項
global: 
  resolve_timeout: 5m #處理逾時時間，預設為5min
  smtp_smarthost: 'smtp.qq.com:587' # 郵箱smtp伺服器代理
  smtp_from: '******@qq.com' # 發送郵箱名稱
  smtp_auth_username: '******@qq.com' # 郵箱名稱
  smtp_auth_password: '******' # 授權碼
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 企業微信位址


# 定義模闆資訊
templates:
  - 'template/*.tmpl'

# 定義路由樹資訊
route:
  group_by: ['alertname'] # 報警分組依據
  group_wait: 20s # 最初即第一次等待多久時間發送一組警報的通知
  group_interval: 20s # 在發送新警報前的等待時間
  repeat_interval: 5m # 發送重複警報的周期 對于email配置中，此項不可以設定過低，否則将會由于郵件發送太多頻繁，被smtp伺服器拒絕
  receiver: 'email' # 發送警報的接收者的名稱，以下receivers name的名稱

# 定義警報接收者資訊
receivers:
  - name: 'email' # 警報
    email_configs: # 郵箱配置
    - to: '******@163.com,******@qq.com'  # 接收警報的email配置，多個郵箱用“,”分隔
      html: '{{ template "test.html" . }}' # 設定郵箱的内容模闆
      headers: { Subject: "[WARN] 報警郵件"} # 接收郵件的标題
    webhook_configs: # webhook配置，不需要可以注釋掉
    - url: 'http://127.0.0.1:5001'
    send_resolved: true
    wechat_configs: # 企業微信報警配置,不需要可以注釋掉
    - send_resolved: true
      to_party: '1' # 接收組的id
      agent_id: '1000002' # (企業微信-->自定應用-->AgentId)
      corp_id: '******' # 企業資訊(我的企業-->CorpId[在底部])
      api_secret: '******' # 企業微信(企業微信-->自定應用-->Secret)
      message: '{{ template "test_wechat.html" . }}' # 發送消息模闆的設定

上述配置了email、webhook和wechat三種報警方式。

注：

1）repeat_interval配置項，對于email來說，此項不可以設定過低，否則将會由于郵件發送太多頻繁，被smtp伺服器拒絕。

2）企業微信注冊位址：https://work.weixin.qq.com

.tmpl模闆配置

郵件報警

$ mkdir template
$ vim template/test.tmpl

{{ define "test.html" }}
<table >
        <tr>
                <td>項目組</td>
                <td>報警項</td>
                <td>執行個體</td>
                <td>報警閥值</td>
                <td>開始時間</td>
                <td>詳情</td>
        </tr>
        {{ range $i, $alert := .Alerts }}
                <tr>
                        <td>{{ index $alert.Labels "group" }}</td>
                        <td>{{ index $alert.Labels "alertname" }}</td>
                        <td>{{ index $alert.Labels "instance" }}</td>
                        <td>{{ index $alert.Annotations "value" }}</td>
                        <td>{{ $alert.StartsAt }}</td>
                        <td>{{ index $alert.Annotations "summary" }}</td>
                </tr>
        {{ end }}
</table>
{{ end }}

上述Labels項，表示prometheus裡面的可選label項。annotation項表示報警規則中定義的annotation項的内容。

企業微信報警

$ vim template/test_wechat.tmpl

{{ define "cdn_live_wechat.html" }}
  {{ range $i, $alert := .Alerts.Firing }}
    [報警項]:{{ index $alert.Labels "alertname" }}
    [執行個體]:{{ index $alert.Labels "instance" }}
    [報警閥值]:{{ index $alert.Annotations "value" }}
    [開始時間]:{{ $alert.StartsAt }}
  {{ end }}
{{ end }}

此處range周遊項與email模闆中略有不同，隻周遊目前沒有處理的報警（Firing）。此項如果不設定，則在Alert中已經Resolved的報警項，也會被發送到企業微信。

定義報警規則

$ cd /usr/local/prometheus
$ vim rule.yml

groups:
  - name: node_status
    rules:
    - alert: node_status # 告警名稱
      expr: probe_success == 0 # 告警的判定條件，參考Prometheus進階查詢來設定
      for: 1m # 滿足告警條件持續時間多久後，才會發送告警
      labels: #标簽項
        status: 嚴重
      annotations: # 解析項，詳細解釋告警資訊
        summary: "group:{{$labels.group}},instance:{{$labels.instance}} has been down "
        description: "group:{{$labels.group}},instance:{{$labels.instance}} has been down "
        value: "{{$value}}"
  - name: CPU
    rules:
    - alert: CPU使用率
      expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[6m]))) by (instance) * 100 > 80
      for: 1m
      labels:
        status: 一般
      annotations:
        summary: "group:{{$labels.group}},instance:{{$labels.instance}}:CPU使用率大于80%"
        value: "{{$value}}"

報警規則可以根據自己的需求進行添加修改。

告警資訊生命周期的3種狀态

inactive：表示目前報警資訊即不是firing狀态也不是pending狀态。
pending：表示在設定的門檻值時間範圍内被激活的。
firing：表示超過設定的門檻值時間被激活的。

修改prometheus配置檔案

$ vim prometheus.yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093
rule_files:
   - "rules.yml"

建立systemd服務

$ vim /lib/systemd/system/alertmanager.service

[Unit]
Description=alertmanager
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

啟動alertmanager

$ systemctl daemon-reload
$ systemctl start alertmanager

驗證是否啟動成功

alertmanager預設監聽端口為9093

$ systemct status alertmanager
$ netstat -lnpt|grep 9093

重新加載prometheus配置

$ systemctl reload prometheus
或
$ curl -X POST http://localhost:9090/-/reload  (啟用了--web.enable-lifecycle選項)

檢視報警

通路web頁面 http://127.0.0.1:9090/alerts，http://127.0.0.1:9090/rules 檢視添加的報警規則。

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔
當監控的名額數值到達規定的門檻值，且滿足定義的報警時間後就會發送警報，在web界面也可以看到相應狀态的變化。

參考文檔

https://www.hi-linux.com/posts/25047.html#%E5%AE%89%E8%A3%85prometheus

https://www.qikqiak.com/post/prometheus-delete-metrics/

https://www.cnblogs.com/longcnblogs/p/9620733.html

prometheus+grafana+node_exporter+alertmanager監控主機及報警安裝prometheus安裝node_exporter安裝grafana安裝alertmanager參考文檔

prometheus+grafana+node_exporter+alertmanager監控主機及報警

安裝prometheus

安裝node_exporter

安裝grafana

安裝alertmanager

參考文檔

繼續閱讀

鍊路追蹤之sleuth全生命周期分析

Spring Boot2 內建 Prometheus 和 Grafana 實作微服務監控入門一 Prometheus二 Grafana 三實作微服務監控

zabbix與prometheus的簡單對比

監控技術選型

Relabeling 重新标記

zabbix5.0實戰監控Tomcatzabbix5.0實戰監控Tomcat

Grafana使用QQ郵箱報警

Grafana郵箱告警配置

Grafana 開源軟體介紹

Prometheus+Grafana+onealert---實作報警引言一、Grafana+onealert報警

【Web開發】Python實作Web儀表盤功能（Grafana）

zabbix監控Rabbitmq（pyhon 自動發現隊列和監控内容）

【監控】JavaMelody In ActionJavaMelody In Action

zabbix4.0監控php-fpm

Docker 叢集監控平台---cAdvisor-InfluxDB-Grafana目錄

Zabbix3.4監控Redis