現在大部分公司都會選擇将應用、中間件、系統等日志存儲在 Elasticsearch 中,如何發現日志中的異常資料并且及時告警通知就顯得十分重要。本文将會介紹兩種主流的日志監控方案,分别是 Yelp 公司開源的 ElastAlert 和 Elastic 官方的商業版功能 Watcher。
如下圖所示,日志資料源是一台 Nginx 伺服器,在該伺服器上安裝 Filebeat 收集 Nginx 日志并輸出到 Elasticsearch,之後會分别示範用 ElastAlert 和 Watcher 兩種方案監控日志并進行告警。

部署 Nginx
安裝依賴
yum install -y gcc gcc-c++ autoconf pcre pcre-devel make automake wget httpd-tools vim tree zlib-devel
下載下傳安裝包
wget http://nginx.org/download/nginx-1.14.0.tar.gz
tar -xzvf nginx-1.14.0.tar.gz
編譯安裝
cd nginx-1.14.0
./configure
配置 Nginx
編輯配置檔案 /usr/local/nginx/conf/nginx.conf,在 Nginx 上配置一個靜态網頁服務。
worker_processes 1;
events {
worker_connections 1024;
}
http {
server {
listen 80;
location / {
root html;
}
}
}
啟動 Nginx:
sbin/nginx
通路 Nginx:
部署 Filebeat
下載下傳并安裝 Filebeat。
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.14.0-x86_64.rpm
sudo rpm -vi filebeat-7.14.0-x86_64.rpm
編輯 /etc/filebeat/filebeat.yml 配置檔案,讀取 Nginx 日志檔案輸出到 Elasticsearch 的 nginx 索引中,字尾是目前日期。
filebeat.inputs:
- type: log
enabled: true
paths:
- /usr/local/nginx/logs/*.log
output.elasticsearch:
hosts: ["192.168.1.8:9200"]
index: "nginx-%{+yyyy.MM.dd}"
#username: "elastic"
#password: "changeme"
setup.ilm.enabled: false
setup.template.name: "nginx"
setup.template.pattern: "nginx-*"
啟動 Filebeat:
systemctl start filebeat
ElastAlert
ElastAlert 是 Yelp 公司開源的一套用 Python 寫的 Elasticsearch 告警架構,可以從 Elasticsearch 當中查詢出比對規則的資料進行告警。
ElastAlert 有以下特點:
- 支援多種比對規則(頻率、門檻值、資料變化、黑白名單、變化率等)。
- 支援多種告警類型(郵件、HTTP POST、自定義腳本等)。
- 支援使用者自定義規則和告警類型。
- 比對項彙總報警,重複告警抑制,告警失敗重試和過期。
- 可用性強,狀态資訊儲存到 Elasticsearch 的索引中。
- 支援調試和審計。
部署 Elastalert
安裝 Python
wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz
tar -zxvf Python-3.6.9.tgz
cd Python-3.6.9
./configure
make && make install
檢查 Python 版本:
python3 -V
yum install gcc libffi-devel python3-devel openssl-devel -y
pip3 install -U pip
pip3 install "setuptools>=11.3"
安裝 Elastalert
python3 install elastalert
配置 Elastalert
克隆代碼到本地:
git clone https://github.com/Yelp/elastalert.git
cd elastalert
我們可以在 ElastAlert 源碼檔案的根目錄下找到一個叫做 config.yaml.example 的檔案,修改檔案名為 config.yaml:
mv config.yaml.example config.yaml
建立存放規則的目錄。
mkdir rules
cd rules
編輯 config.yaml 檔案,修改主配置:
#規則存放的目錄
rules_folder: rules
#運作的頻率
run_every:
minutes: 1
#ElastAlert 将緩存最近一段時間的結果,以防某些日志源不是實時的
buffer_time:
minutes: 45
#Elasticsearch 位址
es_host: 192.168.1.8
#Elasticsearch 端口
es_port: 9200
#Elasticsearch 使用者名密碼(可選)
#es_username: someusername
#es_password: somepassword
#ElastAlert 中繼資料存儲索引
writeback_index: elastalert_status
#如果警報因某種原因失敗,ElastAlert将重試發送警報,直到該時間段結束
alert_time_limit:
days: 2
建立 rules/nginx.yaml 檔案,編輯 rule:
規則内容為:在 1 分鐘内如果查詢 nginx-* 索引的 message 字段比對 到 error 5 次就觸發告警,往指定的 URL 發送一個 HTTP POST 請求。
# Alert when the rate of events exceeds a threshold
# (Required)
# Elasticsearch host
es_host: 192.168.1.8
# (Required)
# Elasticsearch port
es_port: 9200
# (OptionaL) Connect with SSL to elasticsearch
#use_ssl: True
# (Optional) basic-auth username and password for elasticsearch
#es_username: someusername
#es_password: somepassword
# (Required)
# Rule name, must be unique
name: nginx rule
# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: frequency
# (Required)
# Index to search, wildcard supported
index: nginx-*
# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 5
# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:
minutes: 1
# (Required)
# A list of elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- term:
message: "error"
# (Required)
# The alert is use when a match is found
alert:
- "post"
http_post_url: "https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36"
https://webhook.site 網站提供了測試的 Webhook 接口,每個人的 URL 都是獨立的,拷貝這個 URL 複制到 http_post_url 中。
ElastAlert 會把執行記錄存放到一個索引中,可以友善我們稽核和調試。使用以下指令建立這個索引的,預設情況下,索引名叫 elastalert_status。
root@ydt-net-es-node1:/software #elastalert-create-index
Enter Elasticsearch host: 192.168.1.8
Enter Elasticsearch port: 9200
Use SSL? t/f: f
#如果有認證輸入使用者名密碼
Enter optional basic-auth username (or leave blank):
Enter optional basic-auth password (or leave blank):
Enter optional Elasticsearch URL prefix (prepends a string to the URL of every request):
New index name? (Default elastalert_status)
New alias name? (Default elastalert_alerts)
Name of existing index to copy? (Default None)
Elastic Version: 7.9.3
Reading Elastic 6 index mappings:
Reading index mapping 'es_mappings/6/silence.json'
Reading index mapping 'es_mappings/6/elastalert_status.json'
Reading index mapping 'es_mappings/6/elastalert.json'
Reading index mapping 'es_mappings/6/past_elastalert.json'
Reading index mapping 'es_mappings/6/elastalert_error.json'
New index elastalert_status created
Done!
發送 2 個請求,1 個是正确請求,1 個是錯誤請求。
> curl http://192.168.1.134 -I
HTTP/1.1 200 OK
Server: nginx/1.14.2
Date: Mon, 16 Aug 2021 07:28:42 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Wed, 16 Jun 2021 02:46:13 GMT
Connection: keep-alive
ETag: "60c965f5-264"
Accept-Ranges: bytes
> curl http://192.168.1.134/xxxxxx -I
HTTP/1.1 404 Not Found
Server: nginx/1.14.2
Date: Mon, 16 Aug 2021 07:28:43 GMT
Content-Type: text/html
Content-Length: 169
Connection: keep-alive
在 Kibana 上可以看到 Nginx 的日志,錯誤請求會在 access.log 和 error.log 各寫一次,是以這裡看到 3 條記錄。
運作 elastalert-test-rule 指令檢驗配置檔案是否正确并且可以看到規則比對的次數,elastalert-test-rule 指令并不會真正觸發告警。
> elastalert-test-rule rules/nginx.yaml
INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.
To send them but remain verbose, use --verbose instead.
Didn't get any results.
INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.
To send them but remain verbose, use --verbose instead.
1 rules loaded
INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts
#比對一次
INFO:elastalert:Queried rule nginx rule from 2021-08-16 15:28 CST to 2021-08-16 15:29 CST: 1 / 1 hits
Would have written the following documents to writeback index (default is elastalert_status):
elastalert_status - {'rule_name': 'nginx rule', 'endtime': datetime.datetime(2021, 8, 16, 7, 29, 30, 422431, tzinfo=tzutc()), 'starttime': datetime.datetime(2021, 8, 16, 7, 28, 29, 822431, tzinfo=tzutc()), 'matches': 0, 'hits': 1, '@timestamp': datetime.datetime(2021, 8, 16, 7, 29, 30, 527080, tzinfo=tzutc()), 'time_taken': 0.02203655242919922}
1分鐘内連續發送錯誤請求 5 次達到觸發告警的門檻值:
for i in {1..3};do curl http://192.168.1.134/xxxxxx -I;done
此時可以看到發送的告警格式。
> elastalert-test-rule rules/nginx.yaml
INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.
To send them but remain verbose, use --verbose instead.
Didn't get any results.
INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.
To send them but remain verbose, use --verbose instead.
1 rules loaded
INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts
INFO:elastalert:Queried rule nginx rule from 2021-08-16 15:33 CST to 2021-08-16 15:34 CST: 5 / 5 hits
INFO:elastalert:Alert for nginx rule at 2021-08-16T07:34:26.230Z:
INFO:elastalert:nginx rule
At least 5 events occurred between 2021-08-16 15:33 CST and 2021-08-16 15:34 CST
@timestamp: 2021-08-16T07:34:26.230Z
_id: 0CDiTXsBCANUjLffFM2O
_index: nginx-2021.08.16
_type: _doc
agent: {
"ephemeral_id": "4ee4bd89-cb8e-43fb-9331-476c229a5480",
"hostname": "nginx-plus1",
"id": "629442a8-34ab-40db-80a8-16e4fda8dec7",
"name": "nginx-plus1",
"type": "filebeat",
"version": "7.14.0"
}
ecs: {
"version": "1.10.0"
}
host: {
"name": "nginx-plus1"
}
input: {
"type": "log"
}
log: {
"file": {
"path": "/usr/local/nginx/logs/error.log"
},
"offset": 16944
}
message: 2021/08/16 15:34:22 [error] 4022#0: *40 open() "/usr/local/nginx/html/xxxxxx" failed (2: No such file or directory), client: 192.168.1.35, server: , request: "GET /xxxxxx HTTP/1.1", host: "192.168.1.134"
num_hits: 5
num_matches: 1
Would have written the following documents to writeback index (default is elastalert_status):
silence - {'exponent': 0, 'rule_name': 'nginx rule', '@timestamp': datetime.datetime(2021, 8, 16, 7, 34, 42, 866184, tzinfo=tzutc()), 'until': datetime.datetime(2021, 8, 16, 7, 35, 42, 866174, tzinfo=tzutc())}
elastalert_status - {'rule_name': 'nginx rule', 'endtime': datetime.datetime(2021, 8, 16, 7, 34, 42, 810992, tzinfo=tzutc()), 'starttime': datetime.datetime(2021, 8, 16, 7, 33, 42, 210992, tzinfo=tzutc()), 'matches': 1, 'hits': 5, '@timestamp': datetime.datetime(2021, 8, 16, 7, 34, 42, 868045, tzinfo=tzutc()), 'time_taken': 0.015259981155395508}
使用以下指令運作 elastalert,可以看到觸發了告警:
> elastalert --verbose --rule rules/nginx.yaml
1 rules loaded
INFO:elastalert:Starting up
INFO:elastalert:Disabled rules are: []
INFO:elastalert:Sleeping for 59.999839 seconds
INFO:elastalert:Queried rule nginx rule from 2021-08-16 14:54 CST to 2021-08-16 15:39 CST: 7 / 7 hits
INFO:elastalert:HTTP Post alert sent.
INFO:elastalert:Ran nginx rule from 2021-08-16 14:54 CST to 2021-08-16 15:39 CST: 7 query hits (0 already seen), 1 matches, 1 alerts sent
通路
網站可以看到 ElastAlert 發送的 HTTP POST 請求。
查詢 elastalert_status 索引可以看到 ElastAlert 的執行記錄。
GET elastalert_status/_search
#傳回結果
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "elastalert_status",
"_type" : "_doc",
"_id" : "1SDmTXsBCANUjLff0M1Q",
"_score" : 1.0,
"_source" : {
"match_body" : {
"input" : {
"type" : "log"
},
"agent" : {
"hostname" : "nginx-plus1",
"name" : "nginx-plus1",
"id" : "629442a8-34ab-40db-80a8-16e4fda8dec7",
"ephemeral_id" : "4ee4bd89-cb8e-43fb-9331-476c229a5480",
"type" : "filebeat",
"version" : "7.14.0"
},
"@timestamp" : "2021-08-16T07:34:26.230Z",
"ecs" : {
"version" : "1.10.0"
},
"log" : {
"file" : {
"path" : "/usr/local/nginx/logs/error.log"
},
"offset" : 16740
},
"host" : {
"name" : "nginx-plus1"
},
"message" : "2021/08/16 15:34:22 [error] 4022#0: *39 open() \"/usr/local/nginx/html/xxxxxx\" failed (2: No such file or directory), client: 192.168.1.35, server: , request: \"GET /xxxxxx HTTP/1.1\", host: \"192.168.1.134\"",
"_id" : "zyDiTXsBCANUjLffFM2O",
"_index" : "nginx-2021.08.16",
"_type" : "_doc",
"num_hits" : 7,
"num_matches" : 1
},
"rule_name" : "nginx rule",
"alert_info" : {
"type" : "http_post",
"http_post_webhook_url" : [
"https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36"
]
},
"alert_sent" : true,
"alert_time" : "2021-08-16T07:39:35.185929Z",
"match_time" : "2021-08-16T07:34:26.230Z",
"@timestamp" : "2021-08-16T07:39:37.418536Z"
}
}
]
}
}
Watcher
Watcher 是 Elastic 官方提供的一個對日志資料監控和報警的功能,Watcher 屬于收費功能,我們可以在 License Management 中開啟 30 天的試用。
Watcher 由以下 5 個部分組成:
- trigger:定義 watcher 觸發的時間或者周期。
- input:定義資料的來源,可以是一個索引或者 HTTP 請求的結果等等。如果沒有設定輸入将為空。
- condition:定義執行 action 觸發的條件。如果沒有設定預設總是觸發 action。
- transform(可選):修改 watcher 的 payload。
- actions:定義執行的動作,例如 email,webhook,index,logging,slack 等等。
建立 1 個 Watcher:
- trigger:每分鐘運作一次。
- input:通配符比對 nginx-* 的索引,查詢 message 字段中的 error 關鍵字,每次針對在過去5分鐘内發生的事件來進行查詢。
- condition:如果在查詢結果中,比對到 1 次,就觸發 action。
- action:向指定 URL 發送一個 HTTP POST 請求。
PUT _watcher/watch/nginx-watcher
{
"trigger": {
"schedule" : {
"interval" : "1m"
}
},
"input": {
"search": {
"request": {
"indices": [
"nginx-*"
],
"body": {
"query": {
"bool": {
"must": {
"match": {
"message": "error"
}
},
"filter": {
"range": {
"@timestamp": {
"from": "{{ctx.trigger.scheduled_time}}||-5m",
"to": "{{ctx.trigger.triggered_time}}"
}
}
}
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 0
}
}
},
"actions": {
"my_webhook": {
"throttle_period": "2m",
"webhook": {
"method": "POST",
"url": "https://webhook.site/2f64f4b3-8b43-488c-b2df-695136079e36",
"body": "Number of Nginx Error: {{ctx.payload.hits.total}}"
}
}
}
}
檢視剛剛建立的 watcher:
1分鐘内連續發送 5 次錯誤請求。
for i in {1..3};do curl http://192.168.1.134/xxxxxx -I;done
檢視 watcher 狀态,可以看到觸發了 action。
可以看到最新的 Webhook 事件已經被觸發了,而且它的 Raw Content 和我們之前定義的 body 格式是一緻的。
如果我們設定的 watcher 間隔時間比較久,Elasticsearch 為了友善我們測試,提供了_execute 接口,通過執行下面指令可以立即運作一下我們的 watcher。
PUT _watcher/watch/nginx-watcher/_execute
參考資料
- https://zhuanlan.zhihu.com/p/386722918
- https://elastalert.readthedocs.io/
- https://www.elastic.co/guide/en/elasticsearch/reference/7.14/xpack-alerting.html
- https://blog.csdn.net/UbuntuTouch/article/details/106298651
- https://elasticstack.blog.csdn.net/article/details/105340379
- https://elasticstack.blog.csdn.net/article/details/103820572