應用監控以及告警實作

前言

一個Java應用可以不優秀，但是一定不能沒有監控方案。否則極大影響排查線上問題的效以及系統故障的及時告警。試想核心應用挂了一個但是沒有配置告警理想情況幾個小時被自己人發現了但是萬一自己人也沒看到或者沒關注那難道讓服務一直挂下去麼？

客戶：**服務

1.簡介

目前市面上監控有很多種在這裡我們選用的是Promethus

2.Promethus 介紹

Prometheus是一個開源的系統監控和警報工具包，最初由SoundCloud建構。自2012年成立以來，許多公司群組織都采用了Prometheus，該項目擁有非常活躍的開發人員和使用者社群。它現在是一個獨立的開源項目，獨立于任何公司進行維護。為了強調這一點，并澄清項目的治理結構，Prometheus于2016年加入了雲原生計算基金會，成為Kubernetes之後的第二個托管項目。

普羅米修斯收集和存儲它的名額作為時間序列資料，也就是說，名額資訊與它被記錄的時間戳一起存儲，以及稱為标簽的可選鍵值對。

3. Promethus特性

普羅米修斯的主要特點是:

一個多元資料模型，其時間序列資料由度量名稱和鍵/值對辨別
PromQL是一種靈活的查詢語言，可以利用這個次元
不依賴分布式存儲;單個伺服器節點是自治的
時間序列收集通過HTTP上的拉模型進行
通過中間網關支援推送時間序列
目标是通過服務發現或靜态配置發現的
支援多種模式的圖形化和儀表闆

4. Promethus元件

Prometheus生态包括了很多元件，它們中的一些是可選的：

Prometheus主伺服器，用于抓取和存儲時間序列資料
用于檢測應用程式代碼的用戶端庫
用于支援短聲明周期的push網關
針對HAProxy，StatsD，Graphite等服務的特定exporters
警告管理器
各種支援工具

多數Prometheus元件是Go語言寫的，這使得這些元件很容易編譯和部署。

5. Promethus架構

下面這張圖說明了Prometheus的整體架構，以及生态中的一些元件作用:

應用監控以及告警實作

建議這個圖反複多看幾遍這樣後邊會有個整理的了解

6. Promethus适用性

普羅米修斯可以很好地記錄任何純數字時間序列。它既适合于以機器為中心的監視，也适合于高度動态的面向服務的體系結構的監視。在微服務的世界裡，它對多元資料收集和查詢的支援是一個特别的優勢。

Prometheus是為可靠性而設計的，它是您在停機期間使用的系統，允許您快速診斷問題。每個Prometheus伺服器都是獨立的，不依賴于網絡存儲或其他遠端服務。當基礎設施的其他部分損壞時，您可以依賴它，并且不需要設定廣泛的基礎設施來使用它。

7. 環境搭建

下載下傳并解壓 https://prometheus.io/download/
配置 (詳情見https://prometheus.io/docs/prometheus/latest/configuration/configuration/ 無特殊情況，首次配置直接使用預設的進行修改即可)

Prometheus配置是YAML。示例配置如下

global:
  scrape_interval:     15s
  evaluation_interval: 15s

rule_files:
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

全局塊控制Prometheus伺服器的全局配置。我們有兩種選擇。第一個是scrape_interval，它控制Prometheus抓取目标的頻率。您可以為單個目标重寫此值。在這種例子下，全局設定是每15s抓取一次。 evaluation_interval選項控制Prometheus評估規則的頻率。 Prometheus使用規則建立新的時間序列并生成警報。

rule_files塊指定我們希望Prometheus伺服器加載的任何規則的位置。現在我們沒有規則。

最後一個塊scrape_configs控制Prometheus監視的資源。由于Prometheus還将自己的資料公開為HTTP端點，是以它可以抓取并監控自身的健康狀況。在預設配置中，有一個名為prometheus的作業，它會抓取Prometheus伺服器公開的時間序列資料。該作業包含一個靜态配置的目标，即端口9090上的localhost。Prometheus希望名額在/metrics路徑上的目标上可用。是以這個預設的工作是通過URL抓取：http//localhost:9090/metrics。

傳回的時間序列資料将詳細說明Prometheus伺服器的狀态和性能。

啟動Prometheus使用剛才的配置檔案啟動

./prometheus --config.file=prometheus.yml

8. Prometheus其他操作

使用表達式浏覽器

讓我們試着看一下Prometheus收集的關于自己的一些資料。要使用Prometheus的内置表達式浏覽器，請導航到http://localhost:9090/graph

應用監控以及告警實作

正如您可以從http://localhost:9090/metrics收集的那樣，Prometheus導出的一個度量标準稱為promhttp_metric_handler_requests_total（Prometheus伺服器已服務的/ metrics請求的總數）。

繼續并将其輸入表達式輸入框并點選執行：

promhttp_metric_handler_requests_total

應用監控以及告警實作

這應該傳回許多不同的時間序列（以及為每個記錄的最新值），所有時間序列都使用度量标準名稱promhttp_metric_handler_requests_total，但具有不同的标簽。這些标簽指定不同的請求狀态。

如果我們隻對導緻HTTP代碼200的請求感興趣，我們可以使用此查詢來檢索該資訊：

promhttp_metric_handler_requests_total{code=“200”}

效果圖如下

應用監控以及告警實作

要計算傳回的時間序列總數，您可以寫：

count(promhttp_metric_handler_requests_total)

效果圖如下

應用監控以及告警實作

有關表達式語言的更多資訊，請參閱表達式語言文檔。

适用圖表接口

要繪制表達式圖表，請導航到http//localhost:9090/graph graph并使用“圖表”頁籤。

例如，輸入以下表達式來繪制在自我抓取的Prometheus中發生的傳回狀态代碼200的每秒HTTP請求率：

rate(promhttp_metric_handler_requests_total{code=“200”}[1m])

您可以嘗試圖形範圍參數和其他設定。

應用監控以及告警實作

監控其他目标

僅從Prometheus那裡收集名額并不能很好地反映Prometheus的能力。為了更好地了解Prometheus可以做什麼，我們建議您浏覽有關其他exporter的文檔。使用node exporter指南監控Linux或macOS主機名額是一個很好的起點。另外也可以使用官方/第三方的其他的 exporter https://prometheus.io/docs/instrumenting/exporters/

(Exporter)導出器

導出器是暴露Prometheus度量名額的二進制檔案，通常将非Prometheus資料格式轉化為Prometheus支援的資料處理格式

以為到這裡就結束了麼 no~~~

如上的圖形化操作界面怎麼能滿足我們的需求不能每次搜尋都得用自己手動搜吧接下來是圖形化的界面 Grafana 介紹

9. 圖形化界面 Grafana

9.1什麼是Grafana?

Grafana允許您查詢、可視化、提醒和了解您的名額，無論它們存儲在哪裡。建立、探索并與團隊共享漂亮的儀表盤，培養資料驅動的文化。

9.2 環境搭建

下載下傳 curl -O https://dl.grafana.com/oss/release/grafana-7.1.5.darwin-amd64.tar.gz
解壓
啟動 ./bin/grafana-server web
配置修改（可選 https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/ ）

9.3 首次配置

啟動成功後浏覽器通路localhost:3000(預設端口如果修改了請替換)

打開首先會進入登入頁面讓輸入賬号密碼 admin/admin

應用監控以及告警實作

登入時會讓修改預設的密碼看自己選擇

登入成功後會進入首頁

應用監控以及告警實作

然後選擇配置

應用監控以及告警實作

接下來我們會進行添加一個資料源

應用監控以及告警實作

資料源這裡這裡我們選擇 promethus

應用監控以及告警實作

填入 promethus 的url 預設localhost:9090

應用監控以及告警實作

添加完選擇Save&test

9.4 配置 dashboard

配置好資料源之後，我們就可以配置對應的監控資訊了，常見的配置監控已經有對應的模闆了，就不需要我們一個一個地去配置了。（如果不滿足的話，那還是得自己去配）

因為需要配置資料暴露然後交給 promethus 最後 Grafana 進行展示

是以分為四步執行

下載下傳記得選擇與自己系統比對的

首先我們需要下載下傳node_exporter 用來采集伺服器的資料官方下載下傳頁面 https://prometheus.io/download/ 選擇對應的作業系統以及架構這裡我們使用的時darwin amd64

應用監控以及告警實作

選擇 node_exporter

應用監控以及告警實作

啟動

下載下傳完成解壓後進行啟動進行解壓目錄直接運作 ./node_exporter 即可完成啟動
配置promethus 采集該資料

添加 job_name node 以及 targets localhost:9100 （node_exporter啟動的預設端口為9100）

scrape_configs:
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
 - job_name: "prometheus"

   # metrics_path defaults to '/metrics'
   # scheme defaults to 'http'.

   static_configs:
     - targets: ["localhost:9090"]
  #node 是需要新加的
 - job_name: node
   static_configs:
     - targets: ['localhost:9100']

配置修改完成進行啟動 ./prometheus --config.file=prometheus.yml

然後直接import對應的模闆，相關的模闆可以在https://grafana.com/grafana/dashboards/ 這裡查到。

選擇dashboard 然後導入

應用監控以及告警實作

伺服器的監控直接選用8913 儲存後即可看到相關界面

應用監控以及告警實作

界面效果圖如下

應用監控以及告警實作

tips: 檢視promethus 采集的到端點 http://localhost:9090/targets 9090為promethus 預設的端口看到都是自己所配置端點的狀态為up，那就說明正常。

9.5 監控Java項目

maven項目添加依賴

<!--actuator 監控-->
<dependency>
   <groupId>org.springframework.boot</groupId>
   <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!--适配prometheus-->
<dependency>
   <groupId>io.micrometer</groupId>
   <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

對外暴露相關端點

#這裡為了方面 對外暴露了所有的監控資料  生産環境切記不要這麼幹
management.endpoint.health.show-details=always
management.endpoint.metrics.enabled=true
management.endpoint.prometheus.enabled=true
management.endpoints.web.exposure.include=*
management.metrics.export.prometheus.enabled=true

配置promethus采集資料

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    #metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: node
    static_configs:
      - targets: ['localhost:9100'] 
  - job_name: gateway-service
    metrics_path: '/actuator/prometheus'
    static_configs:
      #9091為啟動應用端口号
      - targets: ['localhost:9091']

重新啟動 promethus
dashboard 添加 JVM監控和Springboot監控

在dashboard 添加 JVM監控和Springboot監控這裡選擇4701模闆的JVM監控和12900模闆的SpringBoot監控

分别看下效果

JVM監控

應用監控以及告警實作
springboot 監控

應用監控以及告警實作

配置模闆893來配置監控docker的資訊：

應用監控以及告警實作

難道又要結束了麼當然no~~~ 接下來進入到告警的環節

10.告警

這裡使用告警工具是 AlterManager

10.1 AlterManager介紹

Alertmanager處理用戶端應用程式(如Prometheus伺服器)發送的警報。它負責重複資料删除、分組，并将它們路由到正确的接收器內建，如電子郵件、PagerDuty或OpsGenie。它還負責靜音和抑制警報。

10.2 告警流程介紹

應用監控以及告警實作

告警流程大概有四大步驟

在promethus配置告警資訊以及 alterManager
告警資訊觸發.
promethus推送告警資訊到alterManager
alterManager接收到告警資訊并根據不同方式傳輸到指定位置（郵件 etc 釘釘等等）

10.3 AlterManager下載下傳安裝配置

下載下傳安裝Alertmanager 位址 https://prometheus.io/download/ 選擇作業系統和架構進行下載下傳

應用監控以及告警實作
解壓
配置修改（更為詳細的文檔 https://prometheus.io/docs/alerting/latest/configuration/）

route:
  group_by: ['test']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      #這個位址又是另外一個東西  prometheus-webhook-dingtalk
      - url: 'http://localhost:8060/dingtalk/webhook1/send'

啟動 ./alertmanager --config.file=alertmanager.yml

10.4 配置告警規則以及promethus

配置告警規則建議放在promethus安裝目錄
在promethus安裝目錄下建立 alerts.yml
内容如下

# This is the rules file.

groups:
- name: example
  rules:

  - alert: InstanceDown
    expr: up == 0
    for: 3m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  - alert: AnotherInstanceDown
    expr: up == 0
    for: 10m
    labels:
        severity: page
    annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

這個規則的含義當應用下線後第一個規則會等待3分鐘如果還沒恢複就會發送告警郵件恢複後就不發送了第二個規則等待10分鐘同理

配置promethus

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
    	#alertmanager 服務 端口預設9093
        - targets: ['localhost:9093']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "alerts.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: node
    static_configs:
      - targets: ['localhost:9100'] 
  - job_name: gateway-service
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['localhost:9091']

啟動promethus

./prometheus --config.file=prometheus.yml

10.5 prometheus-webhook-dingtalk

這是啥幹嘛的簡單說就是幫助進行發送釘釘消息的

以前的流程

promethus—>觸發規則-> alterManager->進行發送

現在的流程

promethus—>觸發規則-> alterManager->prometheus-webhook-dingtalk->進行發送

本來我們可以在alterManager單獨配置發送釘釘消息的webhook 但是模闆消息以及密鑰都無法配置（可能我的姿勢不對）

是以加了一層轉發告警消息先到alterManager 然後轉發給 prometheus-webhook-dingtalk prometheus-webhook-dingtalk來負責真實發送告警到釘釘

下載下傳 https://github.com/timonwong/prometheus-webhook-dingtalk/releases/tag/v2.1.0
解壓 & 配置

## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
#  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    #釘釘機器人的位址 
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx
    # 釘釘 secret for signature
    secret: xxxxx

啟動

./prometheus-webhook-dingtalk --config.file=config.yml

10.6 進行測試

把gateway-service服務停止掉然後等待3分鐘（這個時間可以調整)

然後在 prometheus-webhook-dingtalk 控制台即可看到如下日志

表示 alterManager 調用 prometheus-webhook-dingtalk 的 webhook成功了

ts=2023-02-09T11:06:21.615Z caller=entry.go:26 level=info component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=[::1]:62467 user_agent=Alertmanager/0.25.0 uri=http://localhost:8060/dingtalk/webhook1/send resp_status=200 resp_bytes_length=2 resp_elapsed_ms=501.124351 msg="request complete"

同時也可以在釘釘上收到該提示

應用監控以及告警實作

11 tips

釘釘消息模闆還有很多可以自行配置

另外還可以發送郵箱告警等等需要修改 alterManager的配置即可

告警的規則還有很多需要自行研究也有類似的模闆

promethus的擴充：https://prometheus.io/docs/operating/integrations/

promethus的官網： https://prometheus.io/

應用監控以及告警實作

前言

1.簡介

2.Promethus 介紹

3. Promethus特性

4. Promethus元件

5. Promethus架構

6. Promethus适用性

7. 環境搭建

8. Prometheus其他操作

以為到這裡就結束了麼 no~~~

9. 圖形化界面 Grafana

9.1什麼是Grafana?

9.2 環境搭建

9.3 首次配置

9.4 配置 dashboard

9.5 監控Java項目

難道又要結束了麼 當然no~~~ 接下來進入到告警的環節

10.告警

10.1 AlterManager介紹

10.2 告警流程介紹

10.3 AlterManager下載下傳安裝 配置

10.4 配置告警規則 以及promethus

10.5 prometheus-webhook-dingtalk

10.6 進行測試

11 tips

the end good day

繼續閱讀

難道又要結束了麼當然no~~~ 接下來進入到告警的環節

10.3 AlterManager下載下傳安裝配置

10.4 配置告警規則以及promethus