天天看點

夜莺随筆:監控 Linux 主機

前面對夜莺的安裝方法做了一些探讨,接下來就進入使用的階段。

正文

本文環境

  • 夜莺 v5.3
  • node_exporter 1.3.1
  • telegraf 1.21.3
  • CentOS 7.9

node-exporter 部分

node-exporter 是 promethues 官方的采集器,其安裝方法非常簡單。

下載下傳 node-exporter 包

由于 github 國内通路有時候容易出現重置,是以采用南京大學的源。

wget https://s3.jcloud.sjtu.edu.cn/899a892efef34b1b944a19981040f55b-oss01/github-release/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz           

複制

解壓 node-exporter 壓縮包

最後得到一個二進制檔案。

mkdir /opt/node_exporter
mv node_exporter-1.3.1.linux-amd64.tar.gz /opt/node_exporter
cd /opt/node_exporter/
tar xzvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64/           

複制

運作 node-exporter

出現 Listening on 字眼即為運作正常

./node_exporter            

複制

夜莺随筆:監控 Linux 主機

Promethues 配置

找到

prometheus.yml

,這裡由于每個人的環境不一樣,是以檔案所在位址也不一樣,這裡隻示範配置,最後需要注意的是格式問題。

- job_name: "local"
    static_configs:
      - targets: ["10.240.99.198:9100"]           

複制

夜莺随筆:監控 Linux 主機

Prometheus 配置熱重新整理

curl -X POST http://127.0.0.1:9090/-/reload           

複制

配置 node_expoter systemd 守護

mkdir /usr/local/node_exporter
mv /opt/node_exporter/node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/node_exporter/           

複制

[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target           

複制

啟動node_exporter

systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
systemctl status node_exporter           

複制

夜莺随筆:監控 Linux 主機

需要注意的是,node_exporter 采集的資料在夜莺裡無法看到對象清單裡看到,隻能在即時查詢裡看到資料,想要看到資源清單隻能通過 telegraf 的方式監控。

夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機

telegraf 部分

Telegraf 是個 all-in-one 的架構,一個二進制可以搞定機器、網絡裝置、中間件、資料庫、Statsd 等各種采集能力,相比散落的各類 Exporter 而言,維護成本更低一些,Telegraf 支援通過 OpenTSDB 這個 output plugin 來對接夜莺。

下載下傳 telegraf rpm 包

wget https://mirrors.nju.edu.cn/influxdata/yum/el8-x86_64/telegraf-1.21.3-1.x86_64.rpm           

複制

安裝 telegraf

yum localinstall telegraf-1.21.3-1.x86_64.rpm -y           

複制

修改 telegraf 配置

清空原有配置,貼上下面配置,需要修改的地方為 host 和 port,根據自身情況填寫。

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.opentsdb]]
  host = "http://10.240.99.198"
  port = 19000
  http_batch_size = 50
  http_path = "/opentsdb/put"
  debug = false
  separator = "_"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.system]]
  fielddrop = ["uptime_format"]

[[inputs.net]]
  ignore_protocol_stats = true           

複制

重新開機 telegraf

service telegraf restart
systemctl enable telegraf           

複制

檢視夜莺前端

此時可以看到未歸組對象裡有剛剛啟動 telegraf 的主機了。并且在監控看圖 –> 對象視角裡看到相對應的監控名額。

夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機

導入官方監控大盤

進入到監控大盤裡,點選導入

[
  {
    "name": "Linux基本監控名額-Telegraf采集",
    "tags": "HOST",
    "configs": "{\"var\":[{\"name\":\"host\",\"definition\":\"label_values(mem_used_percent, ident)\"}]}",
    "chart_groups": [
      {
        "name": "Default chart group",
        "weight": 0,
        "charts": [
          {
            "configs": "{\"name\":\"整機CPU空閑率(%)\",\"QL\":[{\"PromQL\":\"cpu_usage_idle{cpu=\\\"cpu-total\\\", ident=\\\"$host\\\"}\"}],\"yplotline1\":35,\"yplotline2\":15,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"asc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":0,\"y\":0,\"i\":\"0\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"記憶體可用率(%)\",\"QL\":[{\"PromQL\":\"mem_available_percent{ident=\\\"$host\\\"}\"}],\"yplotline1\":30,\"yplotline2\":15,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"asc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":8,\"y\":0,\"i\":\"1\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"硬碟使用率(%)\",\"QL\":[{\"PromQL\":\"disk_used_percent{ident=\\\"$host\\\"}\"}],\"yplotline1\":87,\"yplotline2\":92,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":16,\"y\":0,\"i\":\"2\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"IO.UTIL(%)\",\"QL\":[{\"PromQL\":\"rate(diskio_io_time{ident=\\\"$host\\\"}[1m])/10\"}],\"yplotline1\":90,\"yplotline2\":null,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"origin\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":0,\"y\":2,\"i\":\"3\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"網卡每分鐘丢包數(個)\",\"QL\":[{\"PromQL\":\"increase(net_drop_in{ident=\\\"$host\\\"}[1m])\",\"Legend\":\"net_drop_in ident:{{ident}} interface:{{interface}}\"},{\"PromQL\":\"increase(net_drop_out{ident=\\\"$host\\\"}[1m])\",\"Legend\":\"net_drop_out ident:{{ident}} interface:{{interface}}\"}],\"yplotline1\":5,\"yplotline2\":20,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"short\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":8,\"y\":2,\"i\":\"4\"}}",
            "weight": 0
          },
          {
            "configs": "{\"name\":\"TCP_TIME_WAIT數量\",\"QL\":[{\"PromQL\":\"netstat_tcp_time_wait{ident=\\\"$host\\\"}\"}],\"yplotline1\":null,\"yplotline2\":20000,\"legend\":false,\"highLevelConfig\":{\"shared\":true,\"sharedSortDirection\":\"desc\",\"precision\":\"short\",\"formatUnit\":1000},\"version\":1,\"layout\":{\"h\":2,\"w\":8,\"x\":16,\"y\":2,\"i\":\"5\"}}",
            "weight": 0
          }
        ]
      }
    ]
  }
]           

複制

夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機

附錄

Linux 常用告警規則

[
  {
    "name": "有位址PING不通,請注意",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "ping_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "有監控對象失聯",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "target_up != 1",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "有端口探測失敗,請注意",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "net_response_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "機器負載-CPU較高,請關注",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "cpu_usage_idle{cpu=\"cpu-total\"} < 25",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "機器負載-記憶體較高,請關注",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "mem_available_percent < 25",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "硬碟-IO非常繁忙",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "rate(diskio_io_time[1m])/10 > 99",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "硬碟-預計再有4小時寫滿",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "predict_linear(disk_free[1h], 4*3600) < 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "網卡-入向有丢包",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "increase(net_drop_in[1m]) > 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "網卡-出向有丢包",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "increase(net_drop_out[1m]) > 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "網絡連接配接-TME_WAIT數量超過2萬",
    "note": "",
    "severity": 2,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "netstat_tcp_time_wait > 20000",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "程序監控-有程序數為0,某程序可能挂了",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_lookup_running == 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "程序監控-程序句柄限制過小",
    "note": "",
    "severity": 3,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_rlimit_num_fds_soft < 2048",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  },
  {
    "name": "程序監控-采集失敗",
    "note": "",
    "severity": 1,
    "disabled": 0,
    "prom_for_duration": 60,
    "prom_ql": "procstat_lookup_result_code != 0",
    "prom_eval_interval": 15,
    "enable_stime": "00:00",
    "enable_etime": "23:59",
    "enable_days_of_week": [
      "1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "0"
    ],
    "notify_recovered": 1,
    "notify_channels": [
      "email",
      "dingtalk",
      "wecom"
    ],
    "notify_repeat_step": 60,
    "callbacks": [],
    "runbook_url": "",
    "append_tags": []
  }
]           

複制

夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機
夜莺随筆:監控 Linux 主機

寫在最後

到了這裡基本就介紹完了,整體看下來有兩個結論,如果采用 exporter 為采集器,那麼夜莺僅僅是充當一個類 grafana 的功能,也就是查詢,如果采用 telegraf 為采集器,那麼就是正常的監控應用,後面會圍繞 telegraf 插件來展開