Prometheus監控學習筆記之Prometheus普羅米修斯監控入門

0x00 概述

視訊講解通過連結網易雲課堂·IT技術快速入門學院進入，更多關于Prometheus的文章。

Prometheus是最近幾年開始流行的一個新興監控告警工具，特别是kubernetes的流行帶動了prometheus的應用。

Prometheus是一套完整的監控告警系統：

Prometheus的主要特點有：

1. a multi-dimensional data model with time series data identified by metric name and key/value pairs
2. a flexible query language to leverage this dimensionality
3. no reliance on distributed storage; single server nodes are autonomous
4. time series collection happens via a pull model over HTTP
5. pushing time series is supported via an intermediary gateway
6. targets are discovered via service discovery or static configuration
7. multiple modes of graphing and dashboarding support

influxdb、openTSDB等，是專門時間序列資料庫，不是一套完整的監控告警系統，缺少告警功能。

Prometheus系統的服務發現功能很強大，可以直接通過Kubernetes等系統的接口，發現要監控的目标，不需要人員幹預，不需要做系統對接方面的開發。

Prometheus系統的三部分：prometheus、alertmanager、*_exporter（多個），下文将分别講解。

這裡使用的機器IP為：192.168.88.10。

0x01 Prometheus

prometheus是最主要的元件，負責采集資料，将告警發送到alertmanager，alertmanager再将告警以各種形式送出。

1. 命名規則

prometheus data model中介紹了資料模型。時間序列以metric的名字命名，可以附帶有多個label，label是一個鍵值對。

metric的命名規則為

[a-zA-Z_:][a-zA-Z0-9_:]*

，其中

被保留用于使用者定義的記錄規則。

label的命名規則為

[a-zA-Z_][a-zA-Z0-9_]*

，以

__

開頭的label名稱被保留用于内部label。

每個采樣點叫做

sample

，它是float64的數值或者精确到毫秒的時間戳。

通過metric名稱和label查詢samples，文法如下：

<metric name>{<label name>=<label value>, ...}

例如：

api_http_requests_total{method="POST", handler="/messages"}

2. metric類型

metric有

Counter

、

Gauge

、

Histogram

和

Summary

四種類型。在名額生成端，也就是應用程式中，調用prometheus的sdk建立metrics的時候，必須要明确是哪一種類型的metrics。見：使用Prometheus SDK輸出Prometheus格式的Metrics

Counter是累計數值，隻能增加或者在重新開機時被歸零。

Gauge是瞬時值。

Histogram（直方圖）對采集的名額進行分組計數，會生成多個名額，分别帶有字尾

_bucket

(僅histogram)、

_sum

、

_count

，其中

_bucket

是區間内計數：

<basename>_bucket{le="<upper inclusive bound>"}

名為

rpc_durations_seconds

histogram生成的metrics：

# TYPE rpc_durations_histogram_seconds histogram
rpc_durations_histogram_seconds_bucket{le="-0.00099"} 0
rpc_durations_histogram_seconds_bucket{le="-0.00089"} 0
rpc_durations_histogram_seconds_bucket{le="-0.0007899999999999999"} 0
rpc_durations_histogram_seconds_bucket{le="-0.0006899999999999999"} 1
rpc_durations_histogram_seconds_bucket{le="-0.0005899999999999998"} 1
rpc_durations_histogram_seconds_bucket{le="-0.0004899999999999998"} 1
rpc_durations_histogram_seconds_bucket{le="-0.0003899999999999998"} 10
rpc_durations_histogram_seconds_bucket{le="-0.0002899999999999998"} 26
rpc_durations_histogram_seconds_bucket{le="-0.0001899999999999998"} 64
rpc_durations_histogram_seconds_bucket{le="-8.999999999999979e-05"} 117
rpc_durations_histogram_seconds_bucket{le="1.0000000000000216e-05"} 184
rpc_durations_histogram_seconds_bucket{le="0.00011000000000000022"} 251
rpc_durations_histogram_seconds_bucket{le="0.00021000000000000023"} 307
rpc_durations_histogram_seconds_bucket{le="0.0003100000000000002"} 335
rpc_durations_histogram_seconds_bucket{le="0.0004100000000000002"} 349
rpc_durations_histogram_seconds_bucket{le="0.0005100000000000003"} 353
rpc_durations_histogram_seconds_bucket{le="0.0006100000000000003"} 356
rpc_durations_histogram_seconds_bucket{le="0.0007100000000000003"} 357
rpc_durations_histogram_seconds_bucket{le="0.0008100000000000004"} 357
rpc_durations_histogram_seconds_bucket{le="0.0009100000000000004"} 357
rpc_durations_histogram_seconds_bucket{le="+Inf"} 357
rpc_durations_histogram_seconds_sum -0.000331219501489902
rpc_durations_histogram_seconds_count 357

Summary同樣産生多個名額，分别帶有字尾

_bucket

(僅histogram)、

_sum

、

_count

，可以直接查詢分位數：

<basename>{quantile="<φ>"}

名為

rpc_durations_seconds

summary生成到metrics：

# TYPE rpc_durations_seconds summary
rpc_durations_seconds{service="exponential",quantile="0.5"} 7.380919552318622e-07
rpc_durations_seconds{service="exponential",quantile="0.9"} 2.291519677915514e-06
rpc_durations_seconds{service="exponential",quantile="0.99"} 4.539723552933882e-06
rpc_durations_seconds_sum{service="exponential"} 0.0005097984764772547
rpc_durations_seconds_count{service="exponential"} 532

Histogram和Summary都可以擷取分位數。

通過Histogram獲得分位數，要将直方圖名額資料收集prometheus中，然後用prometheus的查詢函數histogram_quantile()計算出來。 Summary則是在應用程式中直接計算出了分位數。

Histograms and summaries中闡述了兩者的差別，特别是Summary的的分位數不能被聚合。

注意，這個不能聚合不是說功能上不支援，而是說對分位數做聚合操作通常是沒有意義的。

LatencyTipOfTheDay: You can’t average percentiles. Period中對“分位數”不能被相加平均的做了很詳細的說明：分位數本身是用來切分資料的，它們的平均數沒有同樣的分位效果。

3. Job和Instance

被監控的具體目标是instance，監控這些instances的任務叫做job。每個job負責一類任務，可以為一個job配置多個instance，job對自己的instance執行相同的動作。

隸屬于job的instance可以直接在配置檔案中寫死。也可以讓job自動從consul、kuberntes中動态擷取，這個過程就是下文說的服務發現。

4. 部署、啟動

prometheus

負責根據配置檔案發現監控目标，主動收集資料名額，并檢查是否觸發告警規則，是整個系統的核心。

可以直接使用Prometheus提供二進制檔案：prometheus download。

先下載下傳下來，簡單試用一下：

# wget # https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz
# tar -xvf prometheus-2.3.2.linux-amd64.tar.gz

解壓以後得到下面的檔案：

# $ ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool

如果想要學習源代碼，可以自己從代碼編譯：

go get github.com/prometheus/prometheus
cd $GOPATH/src/github.com/prometheus/prometheus
git checkout <需要的版本>
make build

然後直接運作prometheus程式即可：

#  ./prometheus
level=info ts=2018-08-18T12:57:33.232435663Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-08-18T12:57:33.235107465Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
...

通過192.168.88.10:9090，可以打開promtheus的網頁。

5. prometheus的配置檔案

使用prometheus最關鍵的還是搞清楚它的配置檔案，仔細定制了配置檔案，才能發揮出它的功能。

略微不幸的是，prometheus的配置檔案有一些複雜，官方文檔也不是很好：prometheus configuration。

配置檔案是yaml格式，結構如下：

# $  cat prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

其中

global

是一些正常的全局配置，這裡隻列出了兩個參數：

scrape_interval:     15s      #每15s采集一次資料
  evaluation_interval: 15s      #每15s做一次告警檢測

rule_files

指定加載的告警規則檔案，告警規則放到下一節講。

scrape_configs

指定prometheus要監控的目标，這部分是最複雜的。

在scrape_config中每個監控目标是一個

job

，但job的類型有很多種。可以是最簡單的

static_config

，即靜态地指定每一個目标，例如上面的：

- job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

也可以使用服務發現的方式，動态發現目标，例如将kubernetes中的node作為監控目标：

- job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
      api_server: https://192.168.88.10
      tls_config:
        ca_file:   /opt/app/k8s/admin/cert/ca/ca.pem
        cert_file: /opt/app/k8s/admin/cert/apiserver-client/cert.pem
        key_file:  /opt/app/k8s/admin/cert/apiserver-client/key.pem
    bearer_token_file: /opt/app/k8s/apiserver/cert/token.csv
    scheme: https
    tls_config:
      ca_file:   /opt/app/k8s/admin/cert/ca/ca.pem
      cert_file: /opt/app/k8s/admin/cert/apiserver-client/cert.pem
      key_file:  /opt/app/k8s/admin/cert/apiserver-client/key.pem

使用這個新的配置檔案，啟動prometheus：

./prometheus --config.file=./prometheus.k8s.yml

prometheus運作時會自動探測kubernetes中的node變化，自動将kubernetes中的node作為監控目标。

在prometheus的頁面中可以看到自動生成的監控目标。這裡就不貼圖了，可以自己試一下，或者看一下示範視訊。

目前@2018-08-10 17:14:05，prometheus中與服務發現有關的配置有以下幾項（字首就是支援的系統，sd表示service discovery）：

azure_sd_config
    consul_sd_config
    dns_sd_config
    ec2_sd_config
    openstack_sd_config
    file_sd_config
    gce_sd_config
    kubernetes_sd_config
    marathon_sd_config
    nerve_sd_config
    serverset_sd_config
    triton_sd_config

服務發現

是prometheus最強大的功能之一，這個功能配合relabel_config、*_exporter可以做成很多事情。

6. 使用relabel_config擴充采集能力

relabel_config，顧名思義，可以用來重新設定标簽。标簽是附屬在每個監控目标的每個名額上的。

但有些标簽是雙下劃線開頭的，例如

__address__

，這樣的标簽是内置的有特殊意義的，不會附着在監控名額上。

這樣的标簽有：

__address__         : 檢測目标的位址 
__scheme__          : http、https等
__metrics_path__    : 擷取名額的路徑

上面的三個标簽将被組合成一個完整url，這個url就是監控目标，可以通過這個url讀取到名額。

relabel_config

提供了标簽改寫功能，通過标簽改寫，可以非常靈活地定義url。

另外在每個服務發現配置中，還會定義與服務相關的内置名額，例如

kubernetes_sd_config

的

node

的類型中又定義了：

__meta_kubernetes_node_name: The name of the node object.
__meta_kubernetes_node_label_<labelname>: Each label from the node object.
__meta_kubernetes_node_annotation_<annotationname>: Each annotation from the node object.
__meta_kubernetes_node_address_<address_type>: The first address for each node address type, if it exists.

在上一節中，是直接從預設的位址

http://< NODE IP>/metrics

中采集到每個node資料的，這裡用relabel修改一下，改成從apiserver中擷取：

- job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
      api_server: https://192.168.88.10
      tls_config:
        ca_file:   /opt/app/k8s/admin/cert/ca/ca.pem
        cert_file: /opt/app/k8s/admin/cert/apiserver-client/cert.pem
        key_file:  /opt/app/k8s/admin/cert/apiserver-client/key.pem
    bearer_token_file: /opt/app/k8s/apiserver/cert/token.csv
    scheme: https
    tls_config:
      ca_file:   /opt/app/k8s/admin/cert/ca/ca.pem
      cert_file: /opt/app/k8s/admin/cert/apiserver-client/cert.pem
      key_file:  /opt/app/k8s/admin/cert/apiserver-client/key.pem
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: 192.168.88.10
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics

其實就是在原先的配置後面增加了一節

relabel_configs

的配置。

重新加載配置檔案，過一小會兒，就會發現target的url發生了變化。

relabel_config是一個很強大的功能，除了修改标簽，還可以為采集的名額添加上新标簽：

- source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      replacement: hello_${1}
      target_label: label_add_by_me

在配置檔案中加上上面的内容後，為每個名額都将被添加了一個名為

label_add_by_me

的标簽。

7. 使用relabel_config過濾目标

還可以通過relabel_config将不需要的target過濾：

- job_name: "user_server_icmp_detect"
    consul_sd_configs:
    - server: "127.0.0.1:8500"
    scheme: http
    metrics_path: /probe
    params:
      module: [icmp]
    relabel_configs:
    - action: keep
      source_labels: [__meta_consul_tags]        #如果__meta_consul_tags比對正則，則保留該目标
      regex: '.*,icmp,.*'
    - source_labels: [__meta_consul_service]
      regex: '(.+)@(.+)@(.+)'
      replacement: ${2}
      target_label: type
    - source_labels: [__meta_consul_service]
      regex: '(.+)@(.+)@(.+)'
      replacement: ${1}
      target_label: user
    - source_labels: [__address__]
      regex: (.+):(.+)
      replacement: ${1}
      target_label: __param_target
    - target_label: __address__
      replacement:  10.10.199.154:9115
    - source_labels: [__param_target]
      target_label: instance

8. prometheus的查詢語句

prometheus的查詢語句也是很重要的内容，除了用來查詢資料，後面将要講的告警規則也要用查詢語句描述。

查詢語句直接就是名額的名稱：

go_memstats_other_sys_bytes

但是可以通過标簽篩選：

go_memstats_other_sys_bytes{instance="192.168.88.10"}

标簽屬性可以使用4個操作符：

=: Select labels that are exactly equal to the provided string.
!=: Select labels that are not equal to the provided string.
=~: Select labels that regex-match the provided string (or substring).
!~: Select labels that do not regex-match the provided string (or substring).

并且可以使用多個标簽屬性，用“,”間隔，彼此直接是與的關系，下面是prometheus文檔中的一個例子：

http_requests_total{environment=~"staging|testing|development",method!="GET"}

甚至隻有标簽：

{instance="192.168.88.10"}

對查詢出來的結果進行運算也是可以的：

# 時間範圍截取，Range Vector Selectors
http_requests_total{job="prometheus"}[5m]

# 時間偏移
http_requests_total offset 5m

# 時間段内數值累加
sum(http_requests_total{method="GET"} offset 5m)

還可以進行多元運算：Operators，以及使用函數：Functions。

9. prometheus的告警規則配置

alert rules在單獨的檔案中定義，然後在prometheus.yml中引用：

rule_files:
  - "first_rules.yml"
  # - "second_rules.yml"

rules檔案格式如下：

# $ cat first_rules.yml
groups:
- name: rule1-http_requst_total
  rules:
  - alert:  HTTP_REQUEST_TOTAL
    expr: http_requests_total > 100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: Http request total reach limit

需要注意，還要在prometheus.yml中配置alertmanager的位址：

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

重新加載配置檔案後，可以在prometheus的rule頁面看到告警規則，在alert頁面看到觸發的告警，

現在alertmanager還沒有部署，在下一節部署了alertmanager之後，告警可以在alertmanager中看到。

0x02 alertmanager

alertmanager是用來接收prometheus發出的告警，然後按照配置檔案的要求，将告警用對應的方式發送出去。

将告警集中到alertmanager，可以對告警進行更細緻的管理。

1. alertmanager部署啟動

# wget https://github.com/prometheus/alertmanager/releases/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz
# tar -xvf alertmanager-0.15.2.linux-amd64.tar.gz

解壓以後會得到下面這些檔案：

alertmanager  alertmanager.yml  amtool  LICENSE  NOTICE

直接運作alertmanager就可以啟動，然後通過

http://IP位址:9093/#/alerts

可以打開alertmanager的頁面。

2. alertmanager的配置檔案

alertmanager的配置檔案格式如下：

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

其中最主要的是receivers，它定義了告警的處理方式，這裡是webhook_config，意思是alertmananger将告警轉發到這個url。

alertmanager configuration提供多種告警處理方式，webhook_configs隻是其中一種：

email_config
hipchat_config
pagerduty_config
pushover_config
slack_config
opsgenie_config
victorops_config
webhook_config
wechat_config

3. alertmanager配置郵件通知

這裡給出一個用郵件通知告警的例子，發件郵箱用的是網易郵箱：

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'mail'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'mail'
  email_configs:
  - to: 接收告警用的郵箱 
    from: 你的發件用的網易郵箱
    smarthost:  smtp.163.com:25
    auth_username: 網易郵箱賬号
    auth_password: 網易郵箱密碼
    # auth_secret:
    # auth_identity:
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

注意這裡有

web.hook

和

mail

兩個reciver，使用哪個receive是在上面的router中配置的：

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'mail'

重新加載配置後，就可以收到告警郵件了。

4. alertmanager叢集模式

alertmanager可以配置成叢集模式，即多個alaertmanager一起運作，彼此之間通過gossip協定獲知告警的處理狀态，防止告警重複發出。

這種模式通常用在prometheus需要做高可用的場景中。

prometheus ha deploy的高可用部署通常至少會有兩套prometheus獨立工作，它們會執行各自的告警檢查。

與之相伴的通常也要部署多個alaertmanager，這時候這些alertmanager之間就需要同步資訊，防止告警重複發出。

由于使用的是gossip協定，alermanager的叢集模式配置很簡單，隻需要啟動時指定另一個或多個alertmanager的位址即可：

--cluster.peer=192.168.88.10:9094

0x03 *_exporter

exporter是一組程式，它們分别被用來采集實體機、中間件的資訊。有prometheus官方實作的，還有更多第三方實作的：

Databases
    Aerospike exporter
    ClickHouse exporter
    Consul exporter (official)
    CouchDB exporter
    ElasticSearch exporter
    EventStore exporter
...
Hardware related
    apcupsd exporter
    Collins exporter
    IoT Edison exporter
...
Messaging systems
    Beanstalkd exporter
    Gearman exporter
    Kafka exporter
...
Storage

    Ceph exporter
    Ceph RADOSGW exporter
...
HTTP

    Apache exporter
    HAProxy exporter (official)
...
APIs
    AWS ECS exporter
    AWS Health exporter
    AWS SQS exporter
Logging

    Fluentd exporter
    Google's mtail log data extractor
...
Other monitoring systems
    Akamai Cloudmonitor exporter
    AWS CloudWatch exporter (official)
    Cloud Foundry Firehose exporter
    Collectd exporter (official)
...
Miscellaneous

    ACT Fibernet Exporter
    Bamboo exporter
    BIG-IP exporter
...

這些exporter分别采集對應系統的名額，并将其以prometheus的格式呈現出來，供prometheus采集。

1. blackbox_exporter

blackbox_exporter是一個用來探測url、domain等聯通、響應情況的exporter。

1.1 blackbox_exporter部署啟動

# wegt https://github.com/prometheus/blackbox_exporter/releases/download/v0.12.0/blackbox_exporter-0.12.0.linux-amd64.tar.gz
# tar -xvf blackbox_exporter-0.12.0.linux-amd64.tar.gz

解壓後得到：

blackbox_exporter  blackbox.yml  LICENSE  NOTICE

直接運作，預設監聽位址是:9115：

# ./blaxkbox_exporter

1.2 blackbox_exporter配置檔案與工作原理

prometheus/blackbox_exporter是一個用來探測HTTP、HTTPS、DNS、TCP和ICMP等網絡狀态的工具。

在blockbox_exporter中配置的一個個工作子產品，prometheus/blackbox_exporter config。

配置檔案如下：

# $ cat blackbox.yml
modules:
  http_2xx:
    prober: http
    http:
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp

例如下面的配置中，有兩個工作子產品

http_2xx

和

http_post_2xx

。

modules:
  http_2xx:
    prober: http
    http:
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
    body: '{}'

子產品可以根據需要設定更多的參數和判斷條件：

http_2xx_example:
  prober: http
  timeout: 5s
  http:
    valid_http_versions: ["HTTP/1.1", "HTTP/2"]
    valid_status_codes: []  # Defaults to 2xx
    method: GET
    headers:
      Host: vhost.example.com
      Accept-Language: en-US
    no_follow_redirects: false
    fail_if_ssl: false
    fail_if_not_ssl: false
    fail_if_matches_regexp:
      - "Could not connect to database"
    fail_if_not_matches_regexp:
      - "Download the latest version here"
    tls_config:
      insecure_skip_verify: false
    preferred_ip_protocol: "ip4" # defaults to "ip6"

通過blackbox_exporter的服務位址調用這些子產品，并傳入參數。

例如要擷取域名

www.baidu.com

的名額，要用http_2xx子產品，傳入參數 http://www.baidu.com ：

# http://192.168.88.10:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com%2F

blackbox_exporter将按照http_2xx中的配置探測目标網址http://www.baidu.com，并傳回探測到的名額：

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
...<省略>....
probe_http_version 1.1
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

通過這種方式，prometheus就可以采集到域名、dns、ip等虛拟資源的名額。

可以借助relabel_configs将

__address__

替換為blackbox_exporter的位址，使帶有指定參數的blackbox_exporter的url成為prometheus的監控目标。

1.3 示例：監測kubernetes的叢集node的ping的情況

在blackbox的配置檔案中配置icmp子產品：

icmp:
    prober: icmp

在prometheus.yml中配置服務發現，将

__address__

改寫為blackbox_exporter的位址，并帶上相關參數：

- job_name: 'kubernetes-nodes-ping'
    kubernetes_sd_configs:
    - role: node
      api_server: https://192.168.88.10
      tls_config:
        ca_file:   /opt/app/k8s/admin/cert/ca/ca.pem
        cert_file: /opt/app/k8s/admin/cert/apiserver-client/cert.pem
        key_file:  /opt/app/k8s/admin/cert/apiserver-client/key.pem
    bearer_token_file: /opt/app/k8s/apiserver/cert/token.csv
    scheme: http
    metrics_path: /probe
    params:
      module: [icmp]
    relabel_configs:
    - source_labels: [__address__]
      regex: (.+):(.+)
      replacement: ${1}
      target_label: __param_target
    - target_label: __address__
      replacement: 192.168.88.10:9115
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

重新加載配置後，就可以在prometheus的頁面中可以看到新增的target，而它們的位址是blackbox的位址。

可以在prometheus中搜尋名額probe_success：

# http://10.10.199.154:9090/graph?g0.range_input=1h&g0.expr=probe_success&g0.tab=0

可以編寫下面的告警規則，如果持續2分鐘ping不通，觸發告警：

- name: node_icmp_avaliable
  rules:
  - alert: NODE_ICMP_UNREACHABLE
    expr: probe_success{job="kubernetes-nodes-ping"} == 0
    for: 2m
    labels:
      level: 1
    annotations:
      summary: node is {{ $labels.instance }}

0x03 更改标簽的時機：抓取前修改、抓取後修改、告警時修改

prometheus支援修改标簽。metric的标簽可以在采集端采集的時候直接打上，這是最原始的标簽。

除此之外，還可以在prometheus的配置檔案裡，對metric的label進行修改。

修改的時機有兩個：采集資料之前，通過

relabel_config

；采集資料之後，寫入存儲之前，通過

metric_relabel_configs

。

兩個的配置方式是相同的：

relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
  regex: 'rabbitmq01-exporter'
  replacement: 'public-rabbitmq01.paas.production:5672'
  target_label: instance

metric_relabel_configs:
- source_labels: [node]
  regex: 'rabbit01@rabbit01'
  replacement: 'public-rabbitmq01.paas.production:5672'
  target_label: node_addr

第一個是采集之前通過已有的标簽，采集之前的标簽通常是服務發現時設定的，生成新的标簽instance。

第一個是采集之後，檢查采集的名額，如果标簽

node

比對正則，生成新的标簽node_addr。

如果要修改标簽，target_label指定同名的标簽。

另外

alert_relabel_configs

可以在告警前修改标簽。

參考：relabel_configs vs metric_relabel_configs

0x04 雜項

下面是學習過程中，查詢的一些資料，直接羅列，沒有做整理。

規則檢查：

# promtool check rules /etc/prometheus/alert-rules.yml
# ./promtool check rules alert_rule_test.yml

監測cpu:

# https://stackoverflow.com/questions/49083348/cadvisor-prometheus-integration-returns-container-cpu-load-average-10s-as-0

# In order to get the metric “container_cpu_load_average_10s” the cAdvisor must run with the option “–enable_load_reader=true”

設定kubelet的參數：–enable-load-reader

container_spec_cpu_quota

cadvisro名額

alertmanager-webhook-receiver

Write a bash shell script that consumes a constant amount of RAM for a user defined time

Cadvisor metric “container_network_tcp_usage_total” always “0”

Cadvisor常用容器監控名額

prometheus-book

Relabeling is a powerful tool to dynamically rewrite the label set of a target before it gets scraped.

運算Operators

# 容器CPU負載告警
# container_cpu_load_average_10s, container_spec_cpu_quota, container_spec_cpu_shares, container_spec_cpu_quota
# 容器CPU limit: container_spec_cpu_quota / container_spec_cpu_period
# 計算空間的CPU使用率：sum(rate(container_cpu_usage_seconds_total{namespace=~".+"}[1m])) by (namespace) * 100
# 計算容器CPU使用率：sum(rate(container_cpu_usage_seconds_total{name=~".+"}[1m])) by (name) * 100
# rate(container_cpu_usage_seconds_total{name=~".+"}[1m])

計算容器的記憶體使用率：

# container_memory_usage_bytes{container_name!="", pod_name!=""} / container_spec_memory_limit_bytes{container_name!="", pod_name!=""}

# container_memory_usage_bytes{instance="prod-k8s-node-155-171",container_name!="", pod_name!=""} / container_spec_memory_limit_bytes{instance="prod-k8s-node-155-171",container_name!="", pod_name!=""}

# container_memory_usage_bytes{container_name!="", pod_name!=""} / container_spec_memory_limit_bytes{container_name!="", pod_name!=""} > 0.98

# container_memory_rss{container_name!="", pod_name!=""}/container_spec_memory_limit_bytes{container_name!="", pod_name!=""} >0.98

0x05 參考

prometheus documents
prometheus configuration
prometheus download
prometheus first_steps
prometheus relabel_config
prometheus exporters
prometheus/blackbox_exporter
prometheus/blackbox_exporter config
Promtheus Remote Storage使用案例：多Kubernetes叢集監控方案
Operators
Functions
alerting rules
alertmanager configuration
prometheus ha deploy
prometheus exporter
prometheus data model
prometheus metric types
prometheus Histograms and summaries
LatencyTipOfTheDay: You can’t average percentiles. Period.
relabel_configs vs metric_relabel_configs